Assessing interactive oral skills in EFL contexts

by Jason Beale, MEd (TESOL)

Other than for study or review, no part of this essay may be reproduced in any form without prior permission from the author.



Introduction

1. The purpose of assessment

   1.1  Testing general proficiency
   1.2  Educational placement & diagnosis
   1.3  Formative and summative assessment
   1.4  Testing for special purposes

2. Establishing assessment criteria

   2.1  The importance of validity
   2.2  The components of language use
   2.3  Specifying performance criteria
   2.4  Global and analytic rating scales

3. Choosing the best test format

   3.1  Interview tasks
      3.1.1  Structured interviews
      3.1.2  Unstructured interviews

   3.2  Role play tasks
      3.2.1  Structured role plays (information gap)
      3.2.2  Unstructured role plays

4. Special issues

   4.1  Practicality
   4.2  Bias for best
   4.3  Marking

Conclusion

Bibliography




Introduction

There are many English language teachers working in EFL contexts overseas. Their work often requires the quick assessment of a student's oral ability, usually during a brief initial interview or even in the very first class. This can help determine the choice of class material and the overall aims of a course of instruction. Informal assessment also continues throughout any teaching program, as a way of ensuring that desired outcomes are being achieved and students' needs are being met.

Such informal assessment is clearly a central part of language teaching. It is no less important than the formal testing of achievement, or the testing of employment and academic-related proficiency. It follows that all teachers in EFL contexts, whatever their positions and duties, ought to have a basic understanding of the principles underlying assessment of oral language skills.

1. The purpose of assessment

Before designing oral assessment tasks there needs to be a clear idea of the purpose of assessment. This is essential because the same degree of detail is not required in every testing situation. The purpose of the test will determine the overall shape of the assessment criteria to be used.

1.1 Testing general proficiency

The assessment of general proficiency is independent of a particular syllabus, and provides a broad view of a person's language ability. Ideally it focuses on fundamental oral skills, as well as on common communicative functions. Tasks such as summarising technical data, or describing statistics, require a grasp of fundamental skills of course, but they are clearly limited to academic or employment settings. They are formal tasks requiring particular language and presentation skills.

The term 'proficiency' refers to the practical use of language as a whole. It is therefore best assessed directly by eliciting extended samples of interactive language use in realistic contexts. The indirect assessment of oral language, through controlled response to single test items, has limited value as an indicator of real-life oral proficiency.

Unfortunately there is no such thing as a definitive test of general oral ability that can be applied in any situation. The standard of 'native-like' proficiency is only a convenient abstraction - one that ignores the personal and cultural differences that make communication real and complex. In EFL contexts, such as Japan, testees are often quite unfamiliar with Western cultural references and modes of behaviour, and so the design of test items needs to be as culturally neutral as possible without being too vague.

1.2 Educational placement and diagnosis

Assessment for educational placement is used to place the student in a suitable level for learning. It does not require the same degree of detail as a general proficiency test, but involves matching a student against fairly broad criteria in a band scale. Each 'band' describes the minimum level of ability needed for each stage of instruction. The most basic band scale would consist of only three levels: beginner, intermediate, and advanced.

Diagnostic assessment provides more detailed information on a learner's strengths and weaknesses. It requires descriptive analysis, both by impressionistic description and by rating specific aspects of language use. Such information is valuable for tailoring lessons more closely to learners' needs, and as a standard for evaluating progress at a later stage.

1.3 Formative and summative assessment

Formative assessment indicates a learner's ongoing progress during a course. It need not involve testing under formal conditions, but may simply consist of various impressions and notes that the teacher takes while observing students engage in communicative tasks. Summative assessment on the other hand is the formal measurement of a learner's achievement at the end of a unit or course of instruction. This involves matching student achievement with the stated objectives of the course. Summative assessment is different from general proficiency testing in that the assessment tasks of the former are based on the representative sampling of a syllabus.

1.4 Testing for special purposes

Outside of a specific syllabus of instruction, many people sit self-contained language tests that are recognized by higher education and employment bodies (ie. TOEIC and TOEFL). The design of assessment tasks and choice of language is intended to reflect the skills and knowledge needed in special contexts of work or study.

These kinds of tests are used as high-grade filters that discriminate between learners and rank them against a sliding scale. The main purpose of assessment here is to identify candidates for access to limited opportunities such as scholarships and promotions. As such they are not particularly suitable for assessing an individual's particular level of proficiency in detail.

2. Establishing assessment criteria

According to the Australian Oxford Mini Dictionary a criterion is a "principle or standard by which (a) thing is judged." To test oral language skills there need to be such criteria to act as guidelines for judgement. These should describe the various levels of performance in a way that can be tested both logically and consistently. The last two points are often called 'validity' and 'reliability' in the literature on assessment.

2.1 The importance of validity

Validity has been described as "the single most critical element in constructing foreign language tests" (Nakamura 1995: 126). A valid test has a recognizable logic to it that makes the test a meaningful tool of assessment. The most fundamental kind of validity relates to the underlying theory of language on which the test is constructed (construct validity). This influences the sampling of language material and tasks (content validity), which in turn has an effect on the appearance of the test to the teachers and learners who use it (face validity).

Construct validity requires a set of principles that can adequately describe real-life language use. In the case of oral language skills this is not such a simple matter. Speaking may seem to be a general-purpose ability, but it occurs under many contexts and conditions, and for many reasons. Each has its own characteristics and demands, especially when seen as an interactive skill. In the last few decades a great deal of effort has been made to describe language use as an interactive or communicative system. Canale and Swain's (1980) model of 'communicative competence' is certainly the best known example in the literature on applied linguistics.

2.2 The components of language use

Grammar, vocabulary, and pronunciation all fall under the general category of grammatical (or linguistic) competence in Canale and Swain's influential model. These are the basic skills, traditionally taught and tested in isolation from a communicative context. Yet in order to predict real language use successfully, higher level skills and knowledge also need to be considered.

A second category called discourse competence concerns the way language is conventionally shaped in different communicative contexts. Describing a suspect during a police interview, for example, requires more than basic grammatical skills - it involves selecting, organising and linking elements together to create a structured and coherent whole. Canale and Swain distinguish a third category called sociocultural competence; which covers the cultural forms of speech deemed appropriate in a particular community.

Weir (1993), drawing from Bygate, conveniently includes both discourse and sociocultural aspects of language use under the single heading "routine skills". These are "frequently recurring ways of structuring speech, such as descriptions, comparisons, instructions, telling stories", and includes the patterns of interactional language use seen in such things as "buying goods in a shop, or telephone conversations, interviews, meetings, discussions, decision making, etc" (Weir 1993: 32)

Canale and Swain's fourth category is strategic competence, which covers the various techniques people use to manage and enhance communication. This category is covered by Weir under the heading "improvisation skills" (1993: 32-4). Communication is a faulty and chaotic process and speakers need to be able to improvise when their conventional language routines fail. This includes both the "negotiation of meaning" in various ways to enhance understanding, as well as the "management of interaction" to establish "who is going to speak next and what the topic is going to be" (turn taking and topic initiation).

2.3 Specifying performance criteria

As the preceding section has shown, interactive oral skills involve different categories of practical knowledge (or know-how), each one effectively building on the next. First, basic grammatical & linguistic knowledge (core skills), then discourse & sociocultural knowledge (routine skills), and finally strategic knowledge (improvisational skills). Having this general framework is helpful in identifying the various components of oral ability that can be assessed. Yet deciding what weighting to give each category of skill is still not a straightforward matter.

Improvisational skills are useful in every general context. For example, "Excuse me, what did you say?", or its equivalent, is an essential phrase. In particular contexts, such as business negotiation, there is a greater need for highly developed improvisational skills. In choosing or designing specific performance criteria for an oral test it is important to decide which of these categories are important and to what extent at each level of a candidate's ability. Different criteria will produce different results. As noted by Brown, "if each group were to develop its own assessment framework..., they may, in fact, through the inclusion or weighting of specific criteria, produce schemes which lead to quite different evaluations of candidates ability." (cited in Turner 1998: 198) The assessment criteria need to be related to the actual purpose of the test. This is sometimes called systemic validity. It requires close consultation with the relevant educational and employment bodies to help determine in detail what they intend the assessment instrument to achieve.

2.4 Global rating scales

Performance criteria is usually displayed in a rating scale. A global or wholistic scale provides a general description of ability, in which the various components of language use are grouped together in a single 'band' descriptor:

Band 6: Competent Speaker.
Is able to maintain the theme of dialogue, to follow topic switches and to use and appreciate main attitude markers. Stumbles and hesitates at times but is reasonably fluent otherwise. Some errors and inappropriate language but these will not impede exchange of views. Shows some independence in discussion with ability to initiate. (Carroll cited in Weir 1993: 44)

Global descriptors are not always so brief as this. The Australian Second Language Proficiency Ratings (ASLPR) scale, developed by Ingram and Wylie in 1982, uses an A4 page to present each band descriptor in considerable detail. This allows for increased accuracy of identification, but at the cost of flexibility of assessment. Detailed global scales effectively dictate what combination of skills is to be recognized at each level, although in practice the particular features "may not co-occur in actual student performance" (Turner 1998: 200).

2.5 Analytic rating scales

The term analysis strictly refers to the breaking down of an object into its constituent parts or aspects. This is the opposite of synthesis or the putting together of parts to make a whole. Although the general components of oral language use are those discussed above in 2.2, there are various ways in which this "cake" of abilities can be sliced for assessment. Following are examples of assessment categories from four different analytic rating scales:

FLUENCY, PRONUNCIATION, GRAMMAR, COMPREHENSIBILITY
- Speaking Proficiency English Assessment Kit (SPEAK), Educational Testing Service, USA (Clankie 1995: 124)

FLUENCY, ACCURACY, COMPREHENSION, COMMUNICATIVE ABILITY
- Placement rating scale, Nova conversation school, Japan (unpublished)

ATTITUDE & CONFIDENCE, EXPRESSIVENESS (pronunciation, intonation & volume), BODY LANGUAGE, UNDERSTANDABILITY (for the listener, is the message delivered clearly?), COMMUNICATIVE ABILITY (can the speaker say what he/she wants to say?)
- Negotiated performance profile, Tokyo Denki University, Japan (McClean 1995: 142-3)

FLUENCY, GRAMMATICAL ACCURACY, INTELLIGIBILITY, APPROPRIATENESS, ADEQUACY OF VOCABULARY FOR PURPOSE, RELEVANCE AND ADEQUACY OF CONTENT
- Test in English for Educational Purposes (TEEP), Associated Examining Board, England (Weir 1993: 43-44)

Within each category, different levels of ability need to be distinguished clearly using descriptive language that can be matched against test results. With clear criteria determined by the overall purpose of assessment (systemic validity) and founded on a clear theory of language use (construct validity), it is possible to choose relevant assessment tasks. The choice of relevant tasks is an important step in itself, for as shown in one study of interview-format discourse (cited in Turner 1998: 195), "some of the supposed characteristics of intermediate versus advanced learners represented in the rating scales were not substantiated in the actual performance of intermediate and advanced learners."

3. Choosing the best test format

Since Canale and Swain presented their model of communicative competence twenty years ago (see 2.1-2.2), the communicative approach has spread into both teaching and testing methodology. According to Weir (1988: 82) communicative testing is purposive, interesting, motivating, interactive, unpredictable and realistic.

Assessing interactive language means by definition that there is someone else actively taking part. The person being tested is not only producing language, but is also responding in a communicative way with another interlocutor. This is quite different from non-interactive stimulus response tasks. Techniques that use written or visual prompts to elicit language samples are very straightforward and time-efficient to administer, and can also help to gauge the general educational level of the student. The SPEAK test of oral proficiency is one example of a test composed mostly of non-interactive tasks (Clankie 1995). Unfortunately they fulfil very few of the qualities of communicative testing listed above.

There are many kinds of oral assessment task that can be used - one writer listing over sixty variations (Underhill 1987). In essence there are two general approaches that meet the criteria for interactive assessment. These are interview and role play.

3.1 Interview tasks

Interview tasks are a direct test of language use; that is, "they measure oral skills by having the examinees actually speak" (Turner 1998: 194). Even so, the ostensible context remains that of a language test. Beyond making the candidate feel at ease, there is no attempt to simulate a non-test setting. Interview tasks thus represent a compromise solution to the problem of how to control something that is inherently unpredictable.

3.1.1 Structured interviews

A structured interview composed of set questions has many advantages. It can be reliably used to determine someone's general level in terms of grammatical knowledge, vocabulary, pronunciation and fluency. It can also be used to find out how well the candidate can structure a short narrative, and to what degree they can express more complex points of view. It is relatively cost and time efficient to administer, and if the interview is recorded properly then marking can be a fairly reliable standardised procedure.

A common interview structure has four stages (Nagata 1995):

  1. a friendly warm up
  2. a level check to determine the candidate's overall ability in terms of the criteria
  3. challenging probes to find where performance drops
  4. a final wind down at a less challenging level
In EFL contexts, such as Japan, the structured interview is readily accepted by test users since it mirrors the often formal social relationship that exists between teacher and student. This high face validity makes it a popular method of oral assessment, despite its limitations as a measure of real-life oral ability.

The structured interview allows only a partial assessment of routine and improvisation skills (as defined in 2.2 above). However keen the candidates may be, they remain passive respondents. Interactive routine and improvisational skills require greater freedom on the part of the candidate to direct and initiate the conversational flow.

3.1.2 Unstructured interviews

Standardized assessment is not always necessary, especially in more informal settings. A less structured interview format can more closely approximate the conditions of free conversation, and there is ideally a greater use of interactive skills - including the strategic skills of negotiation of meaning and turn taking. Of course this will depend on there being suitable motivation for conversation, and also a positive atmosphere for communication to happen.

Unfortunately an unstructured interview is not easily applied to large numbers of candidates. It requires experienced interviewers who can facilitate the conversation in an unforced manner, allowing the testee to interact on an equal footing. Regardless of the interviewer's skill, however, the unpredictable nature of the unstructured interview format means it lacks reliability, and is unsuitable for large scale assessment.

3.2 Role play tasks

A role play is language use in a simulated real life situation. Unlike the interview format, role play can focus on a variety of different language functions. This is especially useful for the assessment of specific work-related oral performance. It is a better indicator of real life performance than the interview format, although it tends to favour extroverted candidates with a degree of acting ability (Weir 1988: 88).The assessor can be involved as a participant in the role play, or simply as an observer of two or more testees.

3.2.1 Structured role plays (information gap)

The structured or controlled role play gives the candidates a detailed set of instructions to follow, usually with some kind of form to complete as they go. These are usually called information gap activities since they involve the transfer of information with others to complete a set task. This is a popular method of language practice in EFL classrooms, and there are many published resources available to teachers that can be photocopied for immediate use.

The major drawback of information gap activities is that they are often no more than mechanical exercises requiring the production of linguistic forms on cue. There is very little scope for the purposeful creative use of language, which makes it difficult for students to identify with the role. Tightly scripted information gap tasks have little predictive validity, since real interactive language use is much more unpredictable.

3.2.2 Unstructured role plays

Unstructured role play allows the participants to select and structure language more freely. Instructions on small role play cards can provide more or less detail, depending on the ability of the testees to improvise and initiate. Instructions need to be made clear to the testees before the role play begins so that it is not "a test of comprehension of instructions" (Underhill 1987: 52). Participants should also be able to identify with the role and understand what communication in the role play is meant to achieve.

Role plays can be designed to test language use in various settings, such as at a hotel, a doctor's office, a supermarket, or a boardroom. The role play may focus on general language functions (or purposes), such as asking, checking, describing, complaining, apologizing, or giving advice, to take only a few examples. Unlike information gap activities, role play instructions do not usually specify particular language structures to be used, though they may be implied in the way the instructions are written.

The lack of obvious manipulation of the testees' responses is the main strength of this format. Well-designed role plays are purposive, interesting, motivating, interactive, unpredictable and realistic, to use the characteristics of communicative language given by Weir (1988: 82). This means that there is more scope for higher level testees to display a range of interactional and improvisational skills.

The advantages of this approach means increased validity as a test of real life oral skills, but at the cost of reliability of measurement due to the unpredictability of testees' responses. To some extent this can be balanced out by ensuring there are well defined procedures of assessment based on clear criteria. Testees themselves also need to understand the criteria under which their own language performances will be judged.

4. Special Issues

Some special issues that influence the design and implementation of assessment also need to be mentioned.

4.1 Practicality

The practicality of a test refers to the degree to which it is cost effective and easy to administer. The number of testees, the time constraints for testing and marking, and the available human and physical resources all need to considered carefully before an assessment scheme is chosen. This is not only an issue of money, but also of the perceptions of those who will be taking and using the test. Also, if a test can be administered efficiently by assessors and markers, this increases the validity and reliability of the results as a whole.

4.2 Bias for best

Testing language skills requires getting a representative sample of optimum performance. To 'bias for best' means to elicit a candidate's best performance on a test. A poorly designed or delivered test will not provide consistent results. This may be because confusing instructions favour some students over others, or perhaps because role play situations require specific knowledge or vocabulary that only some of the candidates possess. Also, generally distracting or stressful conditions of assessment will clearly disadvantage some students over others in a way that is unrelated to language ability.

4.3 Marking

Applying descriptive assessment criteria to a candidate's oral performance requires making subjective (or impressionistic) judgements. This is in contrast to objective marking, in which a quantitative marking scheme is mechanically applied to structured tasks, such as multiple choice and sentence completion exercises.

A descriptive scale of oral performance, with clearly defined levels, can be combined with quantitative grades. Subjective judgements matching performance to such descriptors will then generate a quantitative grade score useful for ranking candidates. Analytic rating scales, that describe specific language skills (see 2.5 above), can be graded differently to emphasize the relative importance of different skills. This is called 'weighting' the assessment criteria, and needs to be based on a clear understanding of the stages of language development (construct validity) and the purpose of the assessment instrument (systemic validity). A graded analytic scale can then be combined with a global scale, for example as shown by McClean (1995) in her description of a negotiated grading scheme at a Japanese university.

Grading is very much dependent on the purpose of the test and the way this is reflected in the criteria. An achievement test that is criterion referenced will judge candidates individually on their achievement of learning outcomes. Score distribution depends solely on learning success, and it is theoretically possible for all candidates to receive 100%. On the other hand, a test for selection purposes will need to separate candidates, making fine distinctions between their performances. This kind of comparative assessment is called norm referenced, and the scores are ideally distributed on a bell-shaped curve, so that most candidates are placed at the centre of the distribution.

Conclusion

An effective test of interactive oral skills is not a haphazard selection of tasks chosen at random. Instead each assessment situation presents a set of practical demands that need to be specifically addressed. The principles of validity, reliability, practicality and bias for best provide basic guidelines for evaluating the effectiveness of a test instrument.

A theoretical model of oral skills is also necessary to structure what is fundamentally fleeting and changeable. At the same time it needs to be remembered that human skills are highly dependent on a variety of internal and external factors that are independent of language ability per se. The art of testing involves minimising the influence of such extraneous factors and creating conditions under which all candidates can display their genuine abilities.

Bibliography

Alderson, J. C., K. J. Krahnke & C. W. Stansfield eds. (1987) Reviews of English Language Proficiency Tests. Washington: Teachers of English to Speakers of Other Languages.

Brown, J. D. & S. O. Yamashita eds. (1995) JALT Applied Materials: Language Testing in Japan. Tokyo: The Japan Association for Language Teaching.

Canale, M & M. Swain (1980) "Theoretical bases of communicative approaches to second language teaching and testing." Applied Linguistics, (1) pp. 1-47

Clankie, S. (1995) "The SPEAK test of oral proficiency: A case study of incoming freshmen." In Brown & Yamashita eds. pp. 119-125.

Hughes, A. (1989) Testing for Language Teachers. Cambridge: Cambridge University Press.

McClean, J. (1995) "Negotiating a spoken-English scheme with Japanese university students." In Brown & Yamashita eds. pp. 136-148.

Nagata, H. (1995) "Testing oral ability: ILR and ACTFL oral proficiency interviews." In Brown & Yamashita eds. pp. 108-118.

Nakamura, Y. (1995) "Making speaking tests valid: Practical considerations in a classroom setting." In Brown & Yamashita eds. pp. 126-133.

Turner, J. (1998) "Assessing speaking." Annual Review of Applied Linguistics, Vol 18, pp. 192-207, Cambridge: Cambridge University Press.

Underhill, N. (1987) Testing Spoken Language: A Handbook of Oral Testing Techniques. Cambridge: Cambridge University Press.

Weir, C. J. (1988) Communicative Language Testing with Special Reference to English as a Foreign Language. Exeter: University of Exeter.

Weir, C. J. (1993) Understanding and Developing Language Tests. New York: Prentice Hall.


Back to top