Assessing the speaking proficiency of L2 Chinese learners: Review of the Hanyu Shuiping Kouyu Kaoshi

I have seen a couple of international students that achieved good scores on the HSK level 5— the advanced-level Chinese proficiency test, and yet [they] can barely communicate at all in Chinese, not even daily conversation like “how was your weekend?” (A professor who teaches Chinese at a Confucius Institute in the USA, Interview, February 26, 2022)


Introduction
The Hanyu Shuiping Kaoshi (HSK) is a multi-level, multi-purpose Chinese proficiency test developed by the Center for Language Education and Cooperation (previously the Office of Chinese Language Council International and, henceforth, referred to by its colloquial name "Hanban").It assesses reading, writing, and listening skills of second language (L2) Chinese learners wanting to study or work in China (Peng et al., 2020).Because the test contains no speaking skills section, however, educators and employers in China have complained that there was no way to evaluate test takers' ability in this area based on HSK results (Jin, 2019).To address this concern, Hanban introduced a speaking test-Hanyu Shuiping Kouyu Kaoshi (HSKK)-to measure L2 Chinese learners' general speaking proficiency as a complement to the HSK.Since the original one-level (i.e., advanced-level) HSKK version (used from 1990 to 2009) has been criticized for being impractical and difficult (Wang & Jiang, 2017), the new (and current) three-level HSKK version appeared in 2010 and has gradually attracted the attention of international Chinese learners.HSKK is presently conducted in about 120 countries and 860 cities worldwide, and more than 30,000 L2 Chinese learners take the test yearly (Ding et al., 2021).As the most influential speaking test for L2 Chinese learners, its results are of great importance for test users (Cui, 2010;Wang, 2014Wang, , 2018;;Yang, 2017).The current version of HSKK (Hanban, 2010) is thus reviewed in this paper.

Test purpose
HSKK aims to promote spoken Chinese teaching and learning for L2 learners both domestically and internationally while offering an effective evaluation of test takers' Chinese speaking proficiency to help various entities make score-based decisions (Hanban, 2010).These entities include: (1) Chinese higher education institutes (for making decisions regarding international students about admission, level of Chinese speaking ability for class assignment, and giving and waiving credits); (2) employers (for making decisions about recruitment, training, and promotion of highly skilled international workers); (3) L2 Chinese learners seeking to understand and enhance their Chinese speaking skills; and (4) Chinese language teaching institutes seeking to assess their teaching and training outcomes.

Levels
Hanban (2010) provides a basic description of the three difficulty levels of the HSKK (primary, intermediate, and advanced), which potential test takers can refer to when deciding which level to take.Specifically, primary-level test takers are those who have 6 months of Chinese learning experience (at a pace of 2-3 class hours per week or equivalent) and have mastered about 200 commonly used words.Intermediate-level test takers have 1-2 years of Chinese learning experience and have mastered about 900 of the most common Chinese words.The advanced level describes those test takers who have learned Chinese for more than 2 years and have mastered about 3000 commonly used words.

Length and administration
The test duration of the HSKK ranges from 17 minutes (primary level) to 24 minutes (advanced level).HSKK can be taken in two formats, that is, paper-based and computerbased, depending on test takers' preference and the availability of test centers.Both formats require test takers to audio record their spoken answers; the test takers in the paper-based HSKK, however, are provided with audio recording equipment.In response to the global COVID-19 pandemic, some test centers began offering the computer-based HSKK remotely (Hanban, 2020).Test takers can access the site, www.chinesetest.cn,to check their results by using the test information given on the test ticket (30 working days after the test).
Street, Xicheng District, Beijing, China), an independent legal entity owned by the Center for Language Education and Cooperation, a non-governmental public institution affiliated with the Ministry of Education (MoE) of China.

Price
HSKK fees can vary because of exchange rates.For those wanting to take the HSKK in Beijing or Shanghai, HSKK fees vary from level to level ranging from ¥200 (around US$30) for the primary-level HSKK to ¥400 (around US$60) for the advanced-level HSKK.HSKK fees for the remote test are 30% more.This surcharge is mainly due to the extra work carried out by the test center and invigilators.Specifically, the remote invigilators are required to check whether test takers' physical environment meets the test requirements, that is, whether the test takers' cameras capture body movements, and to verify the test takers' ID documents before the test begins (Hanban, 2020).

Appraisal of the HSKK
The current review adopts Bachman and Palmer's (2010) Assessment Use Argument (AUA), a framework that consists of a set of claims about a test taker's performance (assessment tasks and assessment records) and is connected to the test use (i.e., interpretations, decisions, and consequences), linking all aspects of the HSKK and making a systematic appraisal of the trustworthiness of the test interpretations and the multi-purpose use of the HSKK.AUA is adopted because it is one of the few frameworks that includes an argument for assessment use and specifies the connections between test interpretation and the consequences of test use.The following section explains how well HSKK has performed based on the claims in AUA by comprehensively reviewing current empirical and validation studies and technical reports released by HSKK developers and administrators.

Assessment tasks
According to Bachman and Palmer (2010), assessment tasks are the input (e.g., a paragraph to read or a picture to describe), and questions (e.g., open-ended or closed questions) used in a test to elicit test takers' responses.As the key element of AUA and the data for relevant claims related to test design and use, the assessment tasks and features of tasks, such as structures and authenticity, are worthy of analysis and attention.
Structures.HSKK assesses L2 Chinese learners' speaking skills in three parts across three proficiency levels (see Table 1).
Table 1 shows that the HSKK has six different task types with three tasks at each level.Part I in the primary-level HSKK is a "Listen and repeat" task requiring test takers to repeat 15 sentences precisely.The length and difficulty of the ten sentences at the intermediate-level "Listen and repeat" task increase, which makes the sentences more challenging to remember over a short period of time.At the advanced level, the "Listen and repeat" task is replaced with a "Listen and retell" task that requires test takers to accurately retell the essential information of three short paragraphs, which are usually selected from either narrative, argumentative, or descriptive texts.Sentences in each paragraph at the advanced level are longer and more complex with more information for test takers to remember.
Part II of the primary-level HSKK is a "Listen and reply" task that requires test takers to listen and answer ten questions.Three types of questions often appear in this task: general or Yes/No questions, special or wh-questions, and choice questions.At the intermediate-and advanced-level, the "Listen and reply" task is replaced with the "Describe pictures" and "Read a paragraph" tasks, respectively.Intermediate-level test takers are required to describe two stories coherently based on two pictures without being provided any voice or text materials, while advanced-level test takers are required to read an excerpt from prose, which mainly assesses their pronunciation, intonation, and recognition of Chinese characters.
Part III at all three levels is an "Answer questions" task requiring test takers to read two open-ended questions and use at least five sentences to answer each question."Answer questions" tasks in the primary-and intermediate-level HSKK are annotated with pinyin (the official romanization system for Chinese characters in Mainland China) to help students read and understand Chinese characters in case they cannot recognize the characters.Common topics for the primary-level "Answer questions" task and the first question at the intermediate-level "Answer questions" task often require test takers to describe people, places, experiences, or habits.Two questions at the advanced-level "Answer questions" task and the second question at the intermediate-level "Answer questions" task usually require test takers to give or evaluate an opinion on a certain topic, describe things that involve a hypothetical situation, or explain differences and/or similarities between two things.Test takers at all three levels need to describe or narrate things or events using correct grammar and appropriate words.Intermediate-and advanced-level test takers also need to include advice and express their opinions accurately and logically.Some preparation time is given for the task at all three levels.
Task authenticity.Task authenticity is important when assessing speaking skills as it concerns the relationship between the speaking test and real-world contexts (Luoma, 2004).Hanban (2010) claims that the test developers have tried to incorporate authentic topics familiar to test takers into the HSKK.Nevertheless, the authenticity of the "Listen and repeat" tasks in the primary-and intermediate-level HSKK has been criticized (Jin, 2019).Studies evaluating past tests of HSKK have pointed to inauthentic sentences that are largely absent of context (see Jin, 2019;Wang, 2014Wang, , 2018)).However, some researchers (e.g., Wang & Jiang, 2017;Yan et al., 2016) argued that second language learners usually draw on the spoken language used by their interlocutors and summarize, or even repeat, statements when preparing or giving a response within a conversation in oral communication.Accordingly, the test is valid to some extent because natural conversation and interaction depend in part on repetition, although the authenticity of the "Listen and repeat" task appears lacking compared with other tasks (e.g., "Listen and reply" and "Answer questions").Jin ( 2019) also claimed that the texts selected in the advancedlevel "Read a paragraph" task are excerpts from prose containing formal written language that reflects real-life oral communication to only a very limited extent.

Assessment records
The HSKK records the scores test takers achieve after completing the assessment tasks.
Scoring.In the HSKK, the maximum score that test takers can achieve at all three levels is 100, with "60" as the passing threshold (Hanban, 2010).While the test score itself has no expiration date, some higher education institutes in China may require an HSKK result no more than 2 years old.Test takers receive only a total score (e.g., 90 out of 100) without revealing the scores of the sections.This lack of transparency has led some test takers to request their individual task scores to see which parts can be improved (Ding et al., 2021).Furthermore, Hanban (2010) does not explain why the passing score is set at 60; curiously, there is no study analyzing how this number was decided.Similarly, more information about the setting, monitoring, and validation of the grading process should be provided by the test developer.
Regarding the scoring criteria for the HSKK, Hanban (2010) provides a task-specific rubric (Table 2); however, no information has been provided and no study appears to have validated the design of the rubric; thus, test takers do not know how scores are allocated to the three tasks at each level.Therefore, the test developer should consider providing a more detailed grading scale that totals 100 points.
Reliability.Regarding the reliability of the HSKK, an empirical study conducted by Cui (2010) examined the intermediate-level HSKK between 2008 and 2010.Cui (2010) invited three trained examiners to rate 51 tests and applied generalizability theory procedures to analyze the generalizability coefficient (reliability) of the whole test and every task type, and then applied Spearman's rank correlation and a paired-samples t-test to analyze the relation between the "Describe pictures" and "Answer questions" tasks.
The results suggested that the reliability (.87) was acceptable for the whole test, while the "Describe pictures" task had a higher reliability (.88) than "Answer questions" (.86), and combining scores on the different tasks into a composite score was reasonable.
A more recent study conducted by Ding et al. (2021) investigated the consistency of the test scores at all three levels of the HSKK between the test formats (i.e., paper-based or internet-based) and areas (i.e., whether taken in/outside of China) using an independent-samples t-test.The effect sizes of the primary-(Cohen's d = .37),intermediate-(Cohen's d = .18),and advanced-level (Cohen's d = .15)HSKK were small, indicating that the format had a negligible impact on scoring consistency.However, when examining the area factor on the scoring consistency, they found that the effect sizes of the primary-(Cohen's d = .47)and intermediate-level HSKK (Cohen's d = .68)were close to or above the medium range.Those taking the primary-and intermediate-level HSKK inside China achieved higher average scores than those outside China.Ding et al. (2021) speculated that the difference was not caused by the location of the test but by the students' language environment, that is, the test takers in China were more exposed to the target language environment.Studies have shown that the target language environment can provide more language input, and communication opportunities positively developing learners' linguistic competence (Collentine & Freed, 2004).Another possible explanation is that the language learning motivation of test takers improved after living in China (Ding, 2015).However, the effect size comparing scores of the advanced-level HSKK (Cohen's d = .15)was small, indicating that the area/environment factors had little influence over the advanced-level test takers.
In sum, study findings suggest the HSKK has good consistency among question items, the rubric, and formats; however, scoring consistency regarding other factors such as region and gender should be further explored.

Interpretations
A general claim in language testing and assessment is that "the interpretations of the ability assessed on a test should be meaningful, impartial, generalizable, relevant, and sufficient" (Yao & Wallace, 2021, p. 1).Validity, then, relates to whether the test interpretations are meaningful and significant (Fan & Yan, 2020;Knoch & Chapelle, 2018).This section reviews a key element in AUA-the construct validity of HSKK, which pertains to the validity of the interpretations drawn from the assessment records (Bachman & Palmer, 2010), and provides the ability descriptions and vocabulary requirements of HSKK to interpret test takers' Chinese speaking skills when passing a certain proficiency level.

Construct validity.
Regarding the construct validity of the interpretations derived from the HSKK assessment records, Hanban (2010) claimed that the key construct the test measures is general Chinese speaking proficiency.Specifically, the primary-level HSKK assesses the ability to comprehend and use everyday language and fulfill the demands of various daily tasks.The intermediate-level HSKK assesses test takers' ability to understand intermediate Chinese and communicate effectively with L1 Chinese speakers.The advanced-level HSKK measures test takers' ability to comprehensively understand oral Chinese and present themselves eloquently with advanced and abundant Chinese expressions.
A study conducted by Jin (2019) systematically examined the construct validity of the HSKK.Item discrimination of the six different task types was analyzed based on descriptive statistics and complexity, accuracy, and fluency measures (three standard linguistic indicators frequently used to distinguish test takers' speaking proficiency levels) (Fan & Yan, 2020).Jin (2019) analyzed 40 test takers' audio records and found that "Listen and repeat" tasks in the primary-and intermediate-level HSKK and the "Read a paragraph" task in the advanced-level HSKK may fail to distinguish test takers' Chinese speaking proficiency.Jin's (2019) findings also aligned with Wang's (2020).The remaining four tasks were found to appropriately assess the learners' speaking proficiency, however.To address this deficiency, the primary-and intermediate-level "Listen and repeat" task design should be improved to include more frequently spoken Chinese words and sentence structures.Future validation studies can be conducted on parallel forms of the "Listen and repeat" tasks.The "Read a paragraph" task in the advanced-level HSKK should also be revised to include more domain-specific and formal expressions in oral communication instead of using complex written language that lacks authenticity (as discussed in the "Task authenticity" section).Doing so would result in a more authentic and valid three-level HSKK.
One important component of the speaking construct-interactional competence-is underrepresented in the current semi-direct HSKK (i.e., human-machine/paper) which threatens decisions and conclusions based on scores (see Roever & Dai, 2021).In contrast, a direct test, which has the candidate speaking in real time with a trained examiner is one of the most common modes for assessing oral proficiency (e.g., International English Language Testing System Speaking test, Cambridge English A1-C2 tests, and Business English Certifications) These tests may attempt to mimic a real-life setting and actions as closely as possible while measuring the test takers' oral language ability (and possibly interactional competence) (Qian, 2009;Roever & Dai, 2021).Therefore, test developers should consider building interactional competence into the HSKK by developing the direct testing mode.Inevitably, however, having only the direct testing mode would significantly increase the cost and extra work carried out by the test center entailing recruitment, training, and management of HSKK examiners, and the cost of the faceto-face HSKK would also need to be raised accordingly.
In sum, few validation studies have examined the construct validity of the HSKK; thus, more studies carried out on different Chinese learning contexts and speaking proficiency levels are needed in this area.
Levels and abilities.Hanban (2010) argued that the HSKK score is criterion-referenced.The three levels of HSKK have been aligned with several internationally recognized standards, such as the Chinese Language Proficiency Scales (CLPS), the American Council on the Teaching of Foreign Languages (ACTFL), and the Common European Framework of Reference for Languages (CEFR) (see Table 3).
According to criterion-referenced interpretations, if an HSKK test taker achieves 60 and above, the suggested interpretation is that the test taker has reached a minimum standard based on the ability descriptions, and test takers who score less than 60 have not.However, some researchers (e.g., Ding, 2015;Jin, 2019;Teng, 2017;Wang & Jiang, 2017) have expressed doubt about measuring against other standards; that is, the test takers' scores on the HSKK may not be an accurate indicator of their speaking skills rated by other internationally recognized standards.For example, as mentioned in the "Structures" section, the text in the "Read a paragraph" task of the advanced-level HSKK that lacks authenticity is taken from prose that contains formal written language with no conversation or interaction involved (Jin, 2019).In contrast, in the corresponding CEFR proficient level (C1/C2), a critical criterion of spoken language use, that is, test takers' oral interaction with the examiner, is included and carefully evaluated according to the guidelines.Specifically, test takers at the C2 level are required to take part effortlessly in any conversation, be fairly familiar with idiomatic expressions and colloquialisms, and backtrack and restructure speech whenever conversation difficulties are encountered (Council of Europe, 2020).
Another concern about measuring against other standards is the limited vocabulary size required in the HSKK, which can lead to misinterpretations of HSKK scores.The vocabulary size requirements of HSKK/HSK and other Chinese proficiency standards are compared in Table 4 based on the CEFR levels (basic-A1/A2, independent-B1/B2, proficient-C1/C2) they are supposedly aligned with (Hanban, 2010).For the primarylevel HSKK, 200 words is the minimum criterion, which is far from the 2100-word requirement for primary-level L2 Chinese speakers in the Spoken Chinese Proficiency Grading Standards and Testing Guideline (Ministry of Education [MoE], 2011) and the 2245-word requirement in the newly launched Chinese Proficiency Grading Standards for International Chinese Education (Ministry of Education [MoE], 2021) developed by the MoE.Some test takers have claimed that taking the primary-level HSKK has little value because it requires them to master only 200 words, meaning it is simpler to skip the primary level and go straight to the intermediate level (Ding et al., 2021).This mismatch of vocabulary requirements can also be found in the intermediate-and advanced-level HSKK.Thus, the HSKK needs to be better aligned with other speaking proficiency standards, such as ACTFL, CEFR, CLPS, and the Chinese Proficiency Grading Standards for L2 Chinese learners.

Decisions
According to AUA principles (Bachman & Palmer, 2010), score-based decisions, which presume and build on sound score-based interpretations, can be made by considering the existing values in the community and relevant legal requirements.A general claim is that test scores and other test-related information allow for relevant, helpful, and sufficient decision-making to test users without any adverse consequences due to the assessment process.Hanban (2010) indicated that the HSKK has been specially developed to assess L2 Chinese learners' general speaking skills to inform and support the score-based decision-making needs of higher education institutes and employers, L2 Chinese learners, and Chinese training institutes.
Regarding decision-making on using the HSKK in academic contexts, in 2018, the MoE specified that the HSK and HSKK scores are recognized as language requirements for admission to Chinese higher education institutes.Specifically, undergraduate and graduate students enrolled in Chinese-taught programs must achieve CLPS Level V (equivalent to advanced-level HSKK or HSK Level 5) before completing their second undergraduate year or before graduation (for graduate students).Some higher education institutes in China also admit international students to English Medium Instruction (EMI) programs.EMI students must achieve CLPS Level III (intermediate-level HSKK or HSK Level 3) before graduation, and EMI medicine majors must achieve CLPS Level IV (intermediate-level HSKK or HSK Level 4) before the practicum.Although the MoE documents suggest using the HSKK results in higher education settings because speaking skills are important for students to live and study in a second language environment, most students choose to satisfy the language requirement by taking the HSK since it has been set as a compulsory proficiency requirement for admission by most higher education institutes in China (Wang, 2018).HSK Level 4 is now becoming a globally acknowledged proficiency test for international Chinese learners (Ding et al., 2021;Wang, 2018).Nevertheless, for L2 students who wish to pursue a government-funded Chineserelated program (e.g., Chinese language education, Chinese literature, Chinese history, and Chinese philosophy), the Confucius Institute Headquarters' document states that the HSKK is a compulsory component of the application (Ding et al., 2021).Unfortunately, apart from being used to apply for a few government scholarships, scores on the HSKK do not appear to have wide public credibility, and some Chinese learners do not seem aware of the test (Wang, 2014;Yuan, 2017) perhaps because the HSK is a more comprehensive proficiency exam that assesses three skills (i.e., writing, reading, and listening), and is widely recognized by most higher education institutes in China (Wang, 2018).Thus, higher education institutes in China should consider setting HSKK as a reference or compulsory test before admitting and funding international students given that the HSK and HSKK measure different skills, and the HSKK provides an official reference for gauging international students' general speaking proficiency (Ding et al., 2021;Wang, 2014).
As for employers, training institutes, and learners, whether and how to use the HSKK for specific decision-making purposes largely depends on the needs of the company, language center, and learner.Unfortunately, few studies have collected and analyzed test users' needs and how scores are used for making decisions.Thus, eliciting test users' perceptions, needs, and decision-making processes using HSKK scores is an area for future research.

Consequences
Using the AUA can provide test users with a rich lens to understanding both intended and unintended consequences (Bachman & Palmer, 2010).
There are two fundamental purposes of the HSKK, which are listed under "test purpose" in the introduction section of the review.The first purpose is to promote spoken Chinese teaching and learning domestically and internationally, although these purposes have yet to be well achieved (Ding et al., 2021).Among 437,331 test takers who took the HSK and HSKK in 2018, only 30,407 took the HSKK, accounting for less than 7%.Compared with HSK, which has a long history of research and development, HSKK is still in its developmental stages (Wang, 2014(Wang, , 2018) ) The relatively small number of HSKK test takers may be attributed to its low public credibility; that is, many Chinese learners are not even aware of its existence (Wang, 2014;Yuan, 2017).Meanwhile, the popularity of the HSKK has witnessed an increase in the number of test takers both inside and outside of China (Ding et al., 2021).This growth, however, may have been tempered by the above-discussed concern about the HSKK's low vocabulary requirement.
The second purpose of the HSKK is for test users to make various score-based decisions concerning studying and working in China and learning and teaching the Chinese language.These score-based decisions have attracted test users' attention in the past decade.However, regarding the recruitment of international students, Wang (2018) claimed that the HSKK had become a less important measure in China.Nevertheless, international students applying for Confucius Institute scholarships to study in Chineserelated programs must supply an HSKK score (Ding et al., 2021).Some educational institutes in China also recommend using both the HSK and the HSKK as a comprehensive record of students' listening, reading, writing, and speaking skills (Ding et al., 2021).However, studies have also questioned the appropriateness of using HSKK in academic contexts owing to its focus on general speaking proficiency rather than academically oriented Chinese (Hanban, 2010).Notably, several studies (e.g., Ding et al., 2021;Peng & Yan, 2019;Wang & Jiang, 2017) have revealed that international students face difficulties learning Chinese for academic purposes, even those who have passed the advanced-level HSKK.For example, international students have trouble using Chinese academic words (Peng & Yan, 2019) and appropriate Chinese to present their research findings at academic conferences (Ding et al., 2021;Wang & Jiang, 2017).Thus, because the HSKK scores concern only general speaking skills, a separate academic Chinese speaking proficiency test focusing on spoken academic Chinese for L2 students who want to pursue higher education in China appears to be needed.
Regarding the use of HSKK scores for hiring decisions by Chinese companies, Wang (2018) conducted an exploratory study and found that most companies require neither the HSK nor the HSKK; the HSKK does not appear to have public credibility even though Hanban (2010) argued that one of the test's purposes is to help companies assess international workers' speaking proficiency.Wang (2018) claimed that some companies in China prefer to conduct internal interviews to assess prospective international worker's Chinese vocabulary and communication ability in a business context because they do not have confidence that the HSKK, as a general Chinese speaking test, can accurately reflect international workers' Chinese ability in a business setting.Another test, the Business Chinese Test (BCT) (Oral iBT), may be a better indicator of spoken language competence than the HSKK for employers.According to the BCT administrator's brief description (see www.chinesetest.cn),test takers who pass the advanced-level BCT (Oral iBT) can fully understand and comprehensively use Chinese in authentic and diverse business contexts.However, a better articulation of the different uses of HSKK and BCT (Oral iBT) is needed to help test users understand the differences between the two speaking tests' purposes in general and professional contexts.
As for using the HSKK to improve the learning and teaching of Chinese speaking skills, Wang, (2014Wang, ( , 2018) ) who examined the washback effects of the HSKK in China, found that it did not seem to have an impact on Chinese teachers' language teaching beliefs or their teaching, but it could motivate international students to practice their spoken Chinese and pursue higher education in China.Thus, Wang (2018) concluded that the washback effect of the HSKK on Chinese speaking skills dovetails with Hanban's (2010) intended consequences.However, it is uncertain whether improved teaching of Chinese has been achieved as only a few studies have been conducted on the washback effects of the HSKK, and these have been conducted only at the college level in China.Thus, future investigations should be conducted to examine the HSKK's washback effect across all levels and learning contexts.

Table 1 .
An overview of the HSKK tasks.

Table 3 .
Mapping of the HSKK levels.

Table 4 .
Comparison of the vocabulary size among HSKK/HSK and Chinese proficiency grading standards.