Introduction
1. The purpose of assessment
1.1 Testing general proficiency
1.2 Educational placement & diagnosis
1.3 Formative and summative assessment
1.4 Testing for special purposes
2. Establishing assessment criteria
2.1 The importance of validity
2.2 The components of language use
2.3 Specifying performance criteria
2.4 Global and analytic rating scales
3. Choosing the best test format
3.1 Interview tasks
3.1.1 Structured interviews
3.1.2 Unstructured interviews
3.2 Role play tasks
3.2.1 Structured role plays (information gap)
3.2.2 Unstructured role plays
4. Special issues
4.1 Practicality
4.2 Bias for best
4.3 Marking
Conclusion
Bibliography
Introduction
There are many English language teachers working in EFL contexts overseas. Their work often requires the quick assessment
of a student's oral ability, usually during a brief initial interview or even in the very first class. This can help
determine the choice of class material and the overall aims of a course of instruction. Informal assessment also continues
throughout any teaching program, as a way of ensuring that desired outcomes are being achieved and students' needs are
being met.
Such informal assessment is clearly a central part of language teaching. It is no less important than the formal testing of
achievement, or the testing of employment and academic-related proficiency. It follows that all teachers in EFL contexts,
whatever their positions and duties, ought to have a basic understanding of the principles underlying assessment of oral
language skills.
1. The purpose of assessment
Before designing oral assessment tasks there needs to be a clear idea of the purpose of assessment. This is essential
because the same degree of detail is not required in every testing situation. The purpose of the test will determine
the overall shape of the assessment criteria to be used.
1.1 Testing general proficiency
The assessment of general proficiency is independent of a particular syllabus, and provides a broad view of a person's
language ability. Ideally it focuses on fundamental oral skills, as well as on common communicative functions. Tasks such
as summarising technical data, or describing statistics, require a grasp of fundamental skills of course, but they are
clearly limited to academic or employment settings. They are formal tasks requiring particular language and presentation
skills.
The term 'proficiency' refers to the practical use of language as a whole. It is therefore best assessed directly by
eliciting extended samples of interactive language use in realistic contexts. The indirect assessment of oral language,
through controlled response to single test items, has limited value as an indicator of real-life oral proficiency.
Unfortunately there is no such thing as a definitive test of general oral ability that can be applied in any situation.
The standard of 'native-like' proficiency is only a convenient abstraction - one that ignores the personal and cultural
differences that make communication real and complex. In EFL contexts, such as Japan, testees are often quite unfamiliar
with Western cultural references and modes of behaviour, and so the design of test items needs to be as culturally neutral
as possible without being too vague.
1.2 Educational placement and diagnosis
Assessment for educational placement is used to place the student in a suitable level for learning. It does not require
the same degree of detail as a general proficiency test, but involves matching a student against fairly broad criteria
in a band scale. Each 'band' describes the minimum level of ability needed for each stage of instruction. The most basic
band scale would consist of only three levels: beginner, intermediate, and advanced.
Diagnostic assessment provides more detailed information on a learner's strengths and weaknesses. It requires descriptive
analysis, both by impressionistic description and by rating specific aspects of language use. Such information is valuable
for tailoring lessons more closely to learners' needs, and as a standard for evaluating progress at a later stage.
1.3 Formative and summative assessment
Formative assessment indicates a learner's ongoing progress during a course. It need not involve testing under formal
conditions, but may simply consist of various impressions and notes that the teacher takes while observing students
engage in communicative tasks. Summative assessment on the other hand is the formal measurement of a learner's achievement
at the end of a unit or course of instruction. This involves matching student achievement with the stated objectives of
the course. Summative assessment is different from general proficiency testing in that the assessment tasks of the former
are based on the representative sampling of a syllabus.
1.4 Testing for special purposes
Outside of a specific syllabus of instruction, many people sit self-contained language tests that are recognized by higher
education and employment bodies (ie. TOEIC and TOEFL). The design of assessment tasks and choice of language is intended
to reflect the skills and knowledge needed in special contexts of work or study.
These kinds of tests are used as high-grade filters that discriminate between learners and rank them against a sliding
scale. The main purpose of assessment here is to identify candidates for access to limited opportunities such as
scholarships and promotions. As such they are not particularly suitable for assessing an individual's particular level of
proficiency in detail.
2. Establishing assessment criteria
According to the Australian Oxford Mini Dictionary a criterion is a "principle or standard by which (a) thing is judged."
To test oral language skills there need to be such criteria to act as guidelines for judgement. These should describe the
various levels of performance in a way that can be tested both logically and consistently. The last two points are often
called 'validity' and 'reliability' in the literature on assessment.
2.1 The importance of validity
Validity has been described as "the single most critical element in constructing foreign language tests"
(Nakamura 1995: 126). A valid test has a recognizable logic to it that makes the test a meaningful tool of assessment.
The most fundamental kind of validity relates to the underlying theory of language on which the test is constructed
(construct validity). This influences the sampling of language material and tasks (content validity), which in turn has an
effect on the appearance of the test to the teachers and learners who use it (face validity).
Construct validity requires a set of principles that can adequately describe real-life language use. In the case of oral
language skills this is not such a simple matter. Speaking may seem to be a general-purpose ability, but it occurs under
many contexts and conditions, and for many reasons. Each has its own characteristics and demands, especially when seen as
an interactive skill. In the last few decades a great deal of effort has been made to describe language use as an
interactive or communicative system. Canale and Swain's (1980) model of 'communicative competence' is certainly the best
known example in the literature on applied linguistics.
2.2 The components of language use
Grammar, vocabulary, and pronunciation all fall under the general category of grammatical (or linguistic) competence in
Canale and Swain's influential model. These are the basic skills, traditionally taught and tested in isolation from a
communicative context. Yet in order to predict real language use successfully, higher level skills and knowledge also need
to be considered.
A second category called discourse competence concerns the way language is conventionally shaped in different communicative
contexts. Describing a suspect during a police interview, for example, requires more than basic grammatical skills - it
involves selecting, organising and linking elements together to create a structured and coherent whole. Canale and Swain
distinguish a third category called sociocultural competence; which covers the cultural forms of speech deemed appropriate
in a particular community.
Weir (1993), drawing from Bygate, conveniently includes both discourse and sociocultural aspects of language use under the
single heading "routine skills". These are "frequently recurring ways of structuring speech, such as descriptions,
comparisons, instructions, telling stories", and includes the patterns of interactional language use seen in such things as
"buying goods in a shop, or telephone conversations, interviews, meetings, discussions, decision making, etc"
(Weir 1993: 32)
Canale and Swain's fourth category is strategic competence, which covers the various techniques people use to manage and
enhance communication. This category is covered by Weir under the heading "improvisation skills" (1993: 32-4).
Communication is a faulty and chaotic process and speakers need to be able to improvise when their conventional language
routines fail. This includes both the "negotiation of meaning" in various ways to enhance understanding, as well as the
"management of interaction" to establish "who is going to speak next and what the topic is going to be" (turn taking and
topic initiation).
2.3 Specifying performance criteria
As the preceding section has shown, interactive oral skills involve different categories of practical knowledge
(or know-how), each one effectively building on the next. First, basic grammatical & linguistic knowledge (core skills),
then discourse & sociocultural knowledge (routine skills), and finally strategic knowledge (improvisational skills).
Having this general framework is helpful in identifying the various components of oral ability that can be assessed.
Yet deciding what weighting to give each category of skill is still not a straightforward matter.
Improvisational skills are useful in every general context. For example, "Excuse me, what did you say?", or its equivalent,
is an essential phrase. In particular contexts, such as business negotiation, there is a greater need for highly developed
improvisational skills. In choosing or designing specific performance criteria for an oral test it is important to decide
which of these categories are important and to what extent at each level of a candidate's ability. Different criteria will
produce different results. As noted by Brown, "if each group were to develop its own assessment framework..., they may, in
fact, through the inclusion or weighting of specific criteria, produce schemes which lead to quite different evaluations of
candidates ability." (cited in Turner 1998: 198) The assessment criteria need to be related to the actual purpose of the
test. This is sometimes called systemic validity. It requires close consultation with the relevant educational and
employment bodies to help determine in detail what they intend the assessment instrument to achieve.
2.4 Global rating scales
Performance criteria is usually displayed in a rating scale. A global or wholistic scale provides a general description of
ability, in which the various components of language use are grouped together in a single 'band' descriptor:
-
Band 6: Competent Speaker.
- Is able to maintain the theme of dialogue, to follow topic switches and to use and appreciate
main attitude markers. Stumbles and hesitates at times but is reasonably fluent otherwise. Some errors and inappropriate
language but these will not impede exchange of views. Shows some independence in discussion with ability to initiate.
(Carroll cited in Weir 1993: 44)
Global descriptors are not always so brief as this. The Australian Second Language Proficiency Ratings (ASLPR) scale,
developed by Ingram and Wylie in 1982, uses an A4 page to present each band descriptor in considerable detail.
This allows for increased accuracy of identification, but at the cost of flexibility of assessment. Detailed global
scales effectively dictate what combination of skills is to be recognized at each level, although in practice the particular
features "may not co-occur in actual student performance" (Turner 1998: 200).
2.5 Analytic rating scales
The term analysis strictly refers to the breaking down of an object into its constituent parts or aspects. This is the
opposite of synthesis or the putting together of parts to make a whole. Although the general components of oral language
use are those discussed above in 2.2, there are various ways in which this "cake" of abilities can be sliced for assessment.
Following are examples of assessment categories from four different analytic rating scales:
FLUENCY, PRONUNCIATION, GRAMMAR, COMPREHENSIBILITY
- Speaking Proficiency English Assessment Kit (SPEAK), Educational Testing Service, USA (Clankie 1995: 124)
FLUENCY, ACCURACY, COMPREHENSION, COMMUNICATIVE ABILITY
- Placement rating scale, Nova conversation school, Japan (unpublished)
ATTITUDE & CONFIDENCE, EXPRESSIVENESS (pronunciation, intonation & volume),
BODY LANGUAGE, UNDERSTANDABILITY (for the listener, is the message delivered clearly?),
COMMUNICATIVE ABILITY (can the speaker say what he/she wants to say?)
- Negotiated performance profile, Tokyo Denki University, Japan (McClean 1995: 142-3)
FLUENCY, GRAMMATICAL ACCURACY, INTELLIGIBILITY, APPROPRIATENESS, ADEQUACY OF VOCABULARY FOR PURPOSE,
RELEVANCE AND ADEQUACY OF CONTENT
- Test in English for Educational Purposes (TEEP), Associated Examining Board, England (Weir 1993: 43-44)
Within each category, different levels of ability need to be distinguished clearly using descriptive language that can
be matched against test results. With clear criteria determined by the overall purpose of assessment (systemic validity)
and founded on a clear theory of language use (construct validity), it is possible to choose relevant assessment tasks.
The choice of relevant tasks is an important step in itself, for as shown in one study of interview-format discourse
(cited in Turner 1998: 195), "some of the supposed characteristics of intermediate versus advanced learners represented in
the rating scales were not substantiated in the actual performance of intermediate and advanced learners."
3. Choosing the best test format
Since Canale and Swain presented their model of communicative competence twenty years ago (see 2.1-2.2), the communicative
approach has spread into both teaching and testing methodology. According to Weir (1988: 82) communicative testing is
purposive, interesting, motivating, interactive, unpredictable and realistic.
Assessing interactive language means by definition that there is someone else actively taking part. The person being tested
is not only producing language, but is also responding in a communicative way with another interlocutor. This is quite
different from non-interactive stimulus response tasks. Techniques that use written or visual prompts to elicit language
samples are very straightforward and time-efficient to administer, and can also help to gauge the general educational level
of the student. The SPEAK test of oral proficiency is one example of a test composed mostly of non-interactive tasks
(Clankie 1995). Unfortunately they fulfil very few of the qualities of communicative testing listed above.
There are many kinds of oral assessment task that can be used - one writer listing over sixty variations (Underhill 1987).
In essence there are two general approaches that meet the criteria for interactive assessment. These are interview and role
play.
3.1 Interview tasks
Interview tasks are a direct test of language use; that is, "they measure oral skills by having the examinees actually
speak" (Turner 1998: 194). Even so, the ostensible context remains that of a language test. Beyond making the candidate
feel at ease, there is no attempt to simulate a non-test setting. Interview tasks thus represent a compromise solution to
the problem of how to control something that is inherently unpredictable.
3.1.1 Structured interviews
A structured interview composed of set questions has many advantages. It can be reliably used to determine someone's general
level in terms of grammatical knowledge, vocabulary, pronunciation and fluency. It can also be used to find out how well
the candidate can structure a short narrative, and to what degree they can express more complex points of view. It is
relatively cost and time efficient to administer, and if the interview is recorded properly then marking can be a fairly
reliable standardised procedure.
A common interview structure has four stages (Nagata 1995):
- a friendly warm up
- a level check to determine the candidate's overall ability in terms of the criteria
- challenging probes to find where performance drops
- a final wind down at a less challenging level
In EFL contexts, such as Japan, the structured interview is readily accepted by test users since it mirrors the often
formal social relationship that exists between teacher and student. This high face validity makes it a popular method of
oral assessment, despite its limitations as a measure of real-life oral ability.
The structured interview allows only a partial assessment of routine and improvisation skills (as defined in 2.2 above).
However keen the candidates may be, they remain passive respondents. Interactive routine and improvisational skills require
greater freedom on the part of the candidate to direct and initiate the conversational flow.
3.1.2 Unstructured interviews
Standardized assessment is not always necessary, especially in more informal settings. A less structured interview format
can more closely approximate the conditions of free conversation, and there is ideally a greater use of interactive skills
- including the strategic skills of negotiation of meaning and turn taking. Of course this will depend on there being
suitable motivation for conversation, and also a positive atmosphere for communication to happen.
Unfortunately an unstructured interview is not easily applied to large numbers of candidates. It requires experienced
interviewers who can facilitate the conversation in an unforced manner, allowing the testee to interact on an equal footing.
Regardless of the interviewer's skill, however, the unpredictable nature of the unstructured interview format means it
lacks reliability, and is unsuitable for large scale assessment.
3.2 Role play tasks
A role play is language use in a simulated real life situation. Unlike the interview format, role play can focus on a
variety of different language functions. This is especially useful for the assessment of specific work-related oral
performance. It is a better indicator of real life performance than the interview format, although it tends to favour
extroverted candidates with a degree of acting ability (Weir 1988: 88).The assessor can be involved as a participant in
the role play, or simply as an observer of two or more testees.
3.2.1 Structured role plays (information gap)
The structured or controlled role play gives the candidates a detailed set of instructions to follow, usually with some
kind of form to complete as they go. These are usually called information gap activities since they involve the transfer
of information with others to complete a set task. This is a popular method of language practice in EFL classrooms, and
there are many published resources available to teachers that can be photocopied for immediate use.
The major drawback of information gap activities is that they are often no more than mechanical exercises requiring the
production of linguistic forms on cue. There is very little scope for the purposeful creative use of language, which makes
it difficult for students to identify with the role. Tightly scripted information gap tasks have little predictive validity,
since real interactive language use is much more unpredictable.
3.2.2 Unstructured role plays
Unstructured role play allows the participants to select and structure language more freely. Instructions on small role play
cards can provide more or less detail, depending on the ability of the testees to improvise and initiate. Instructions need
to be made clear to the testees before the role play begins so that it is not "a test of comprehension of instructions"
(Underhill 1987: 52). Participants should also be able to identify with the role and understand what communication in the
role play is meant to achieve.
Role plays can be designed to test language use in various settings, such as at a hotel, a doctor's office, a supermarket,
or a boardroom. The role play may focus on general language functions (or purposes), such as asking, checking, describing,
complaining, apologizing, or giving advice, to take only a few examples. Unlike information gap activities, role play
instructions do not usually specify particular language structures to be used, though they may be implied in the way the
instructions are written.
The lack of obvious manipulation of the testees' responses is the main strength of this format. Well-designed role plays
are purposive, interesting, motivating, interactive, unpredictable and realistic, to use the characteristics of communicative
language given by Weir (1988: 82). This means that there is more scope for higher level testees to display a range of
interactional and improvisational skills.
The advantages of this approach means increased validity as a test of real life oral skills, but at the cost of reliability
of measurement due to the unpredictability of testees' responses. To some extent this can be balanced out by ensuring there
are well defined procedures of assessment based on clear criteria. Testees themselves also need to understand the criteria
under which their own language performances will be judged.
4. Special Issues
Some special issues that influence the design and implementation of assessment also need to be mentioned.
4.1 Practicality
The practicality of a test refers to the degree to which it is cost effective and easy to administer. The number of testees,
the time constraints for testing and marking, and the available human and physical resources all need to considered carefully
before an assessment scheme is chosen. This is not only an issue of money, but also of the perceptions of those who will be
taking and using the test. Also, if a test can be administered efficiently by assessors and markers, this increases the
validity and reliability of the results as a whole.
4.2 Bias for best
Testing language skills requires getting a representative sample of optimum performance. To 'bias for best' means to elicit
a candidate's best performance on a test. A poorly designed or delivered test will not provide consistent results. This
may be because confusing instructions favour some students over others, or perhaps because role play situations require
specific knowledge or vocabulary that only some of the candidates possess. Also, generally distracting or stressful
conditions of assessment will clearly disadvantage some students over others in a way that is unrelated to language ability.
4.3 Marking
Applying descriptive assessment criteria to a candidate's oral performance requires making subjective (or impressionistic)
judgements. This is in contrast to objective marking, in which a quantitative marking scheme is mechanically applied to
structured tasks, such as multiple choice and sentence completion exercises.
A descriptive scale of oral performance, with clearly defined levels, can be combined with quantitative grades. Subjective
judgements matching performance to such descriptors will then generate a quantitative grade score useful for ranking
candidates. Analytic rating scales, that describe specific language skills (see 2.5 above), can be graded differently to
emphasize the relative importance of different skills. This is called 'weighting' the assessment criteria, and needs to
be based on a clear understanding of the stages of language development (construct validity) and the purpose of the
assessment instrument (systemic validity). A graded analytic scale can then be combined with a global scale, for example
as shown by McClean (1995) in her description of a negotiated grading scheme at a Japanese university.
Grading is very much dependent on the purpose of the test and the way this is reflected in the criteria. An achievement
test that is criterion referenced will judge candidates individually on their achievement of learning outcomes. Score
distribution depends solely on learning success, and it is theoretically possible for all candidates to receive 100%.
On the other hand, a test for selection purposes will need to separate candidates, making fine distinctions between their
performances. This kind of comparative assessment is called norm referenced, and the scores are ideally distributed on a
bell-shaped curve, so that most candidates are placed at the centre of the distribution.
Conclusion
An effective test of interactive oral skills is not a haphazard selection of tasks chosen at random. Instead each assessment
situation presents a set of practical demands that need to be specifically addressed. The principles of validity,
reliability, practicality and bias for best provide basic guidelines for evaluating the effectiveness of a test instrument.
A theoretical model of oral skills is also necessary to structure what is fundamentally fleeting and changeable. At the same
time it needs to be remembered that human skills are highly dependent on a variety of internal and external factors that are
independent of language ability per se. The art of testing involves minimising the influence of such extraneous factors
and creating conditions under which all candidates can display their genuine abilities.
Bibliography
Alderson, J. C., K. J. Krahnke & C. W. Stansfield eds. (1987) Reviews of English Language Proficiency Tests. Washington:
Teachers of English to Speakers of Other Languages.
Brown, J. D. & S. O. Yamashita eds. (1995) JALT Applied Materials: Language Testing in Japan. Tokyo: The Japan Association
for Language Teaching.
Canale, M & M. Swain (1980) "Theoretical bases of communicative approaches to second language teaching and testing."
Applied Linguistics, (1) pp. 1-47
Clankie, S. (1995) "The SPEAK test of oral proficiency: A case study of incoming freshmen." In Brown & Yamashita eds.
pp. 119-125.
Hughes, A. (1989) Testing for Language Teachers. Cambridge: Cambridge University Press.
McClean, J. (1995) "Negotiating a spoken-English scheme with Japanese university students." In Brown & Yamashita eds. pp.
136-148.
Nagata, H. (1995) "Testing oral ability: ILR and ACTFL oral proficiency interviews." In Brown & Yamashita eds. pp. 108-118.
Nakamura, Y. (1995) "Making speaking tests valid: Practical considerations in a classroom setting." In Brown & Yamashita eds.
pp. 126-133.
Turner, J. (1998) "Assessing speaking." Annual Review of Applied Linguistics, Vol 18, pp. 192-207, Cambridge: Cambridge
University Press.
Underhill, N. (1987) Testing Spoken Language: A Handbook of Oral Testing Techniques. Cambridge: Cambridge University Press.
Weir, C. J. (1988) Communicative Language Testing with Special Reference to English as a Foreign Language. Exeter:
University of Exeter.
Weir, C. J. (1993) Understanding and Developing Language Tests. New York: Prentice Hall.