查看原文
其他

刊讯 | SSCI 期刊《语言测试》2024年1-2期

七万学者关注了→ 语言学心得
2024-09-29

LANGUAGE TESTING

Volume 41, Issue 1-2, 2024

LANGUAGE TESTING(SSCI一区,2023 IF:2.2,排名:33/194) 2024年1-2期共发文20篇,其中研究性论文18篇,书评2篇。研究论文涉及书面特征与听说成绩的关系、外语阅读的拼写诊断评估、心理阅读词汇与阅读水平的关系、多邻国英语测试、口音熟悉度等。欢迎转发扩散!

往期推荐:

刊讯 | SSCI 期刊《语言测试》2023年第3-4期

刊讯|SSCI 期刊《语言测试》2023年第1-2期

目录


ARTICLES

Issue 1

Purposeful turns for more equitable and transparent publishing in language testing and assessment, by Talia Isaacs,  Paula M. Winke, Pages 3-8.

■Assessing speaking through multimodal oral presentations: The case of construct  underrepresentation in EAP contexts, by Louise Palmour , Pages 9-34.

■The relationship between written discourse features and integrated listening-to-write scores for adolescent English language learners , by Ray J. T. Liao,  Renka Ohta,Kwangmin Lee

Pages 35-39.

■English foreign language reading and spelling diagnostic assessments informing teaching and learning of young learners , by Janina Kahn-Horwitz and Zahava Goldstein, Pages 60-88.

■Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency, by Jeffrey Stewart, Henrik Gyllstadhttps

Christopher Nicklinhttps, Stuart McLean, Pages 89-108.

■Diagnosing Chinese EFL learners’ writing ability using polytomous cognitive diagnostic models, by Xiaoting Shi,  Xiaomei Ma, Wenbo Du, Xuliang Gao, Pages 109-134.

■Development of the American Sign Language Fingerspelling and Numbers Comprehension Test (ASL FaN-CT), by 

Corrine Occhino, Ryan Lidster, Leah C. Geer,Jason Listman, Peter C. Hauser, Pages 135-161.

■Critical discursive approaches to evaluating policy-driven testing: Social impact as a target for validation , by Dongil Shin, Pages 162-180.

Rethinking student placement to enhance efficiency and student agency,  by , Beverly Baker, Angel Arias , Louis-David Bibeau, Yiwei (Coral) Qin, Margret Norenberg, Jennifer St-John

Pages 181-191.

■Practical considerations when building concordances between English tests, by Ramsey L. Cardwell, Steven W. NydickJ.R. Lockwood, Alina A. von Davier, Pages 192-202.

■ Our validity looks like justice. Does yours? by Jennifer Randall Mya Poe David Slomp3 Maria Elena Oliveri4, Pages 203-219.


Issue 2

■ Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university, by Daniel R. Isbell, Dustin Crowther,Hitoshi Nishizawa, Pages 233-262.

Fairness of using different English accents: The effect of shared L1s in listening tasks of the Duolingo English test, by Okim Kanghttps, Xun Yanhttps, Maria Kostromitina, Ron Thomson, and Talia Isaacs, Pages 263-289.

Revisiting raters’ accent familiarity in speaking tests: Evidence that presentation mode interacts with accent familiarity to variably affect comprehensibility ratings, by Michael D. Carey, Stefan Szocs, Pages 290-315.

Assessing the content quality  of essays in content and language integrated learning: Exploring the construct from subject specialists’ perspectives, by Takanori Sato, Pages 316-337.

Language testers and their place in the policy web , by Laura Schildt, Bart Deygers, Albert Weideman, Pages 338-356.

Comparing two formats of data-driven rating scales for classroom assessment of pragmatic performance with roleplays, by Yunwen Su, Sun-Young Shin, Pages 357-383.

Triangulating natural  language processing  (NLP)-based analysis of rater comments and many-facet Rasch measurement (MFRM): An innovative approach to investigating raters’ application of rating scales in writing assessment, by Huiying Cai, Xun Yan, Pages 384-411.

The development of a Chinese vocabulary proficiency test (CVPT) for learners of Chinese as a second/foreign language,  by  Haiwei Zhang, Peng Sun, Winda Widiawati, Pages 412-442.

Making each point count: Revising a local adaptation of the Jacobs et al.’s (1981) ESL COMPOSITION PROFILE rubrics, by Yu-Tzu Chang, Ann Tai Choe, Daniel Holden, Daniel R. Isbell, Pages 443-455.

摘要


Purposeful turns for more equitable and transparent publishing in language testing and assessment

Talia Isaacs, Department of University College London, UK

Paula M. Winke,Department of Michigan State University, USA

Abstract This Editorial comes at a time when the after-effects of the acute phase of the COVID-19 pandemic are still being felt but when, in most countries around the world, there has been some easing of restrictions and a return to (quasi-)normalcy. In the language testing and assessment community, many colleagues relished the opportunity to meet and participate in events at the 44th annual Language Testing Research Colloquium (LTRC) in New York in July 2023. This was after 4 years of LTRC exclusively being held online due to public health concerns, restrictions on movement, and other policy-related and logistical matters. In the context of this Editorial, which comes out annually, we find it liberating to be able to focus on matters that are non-pandemic related. In terms of the day-to-day business of managing the journal, we have moved beyond a time of crisis, as reflected in the removal of a note about pandemic effects in our Author and Reviewer e-mail invitation templates.In this annual address, we note a change of the guard in the editorial team that will have come into effect by the time this Editorial is published and some elements of continuity. We also reflect on developments over the past year while briefly touching on what lies ahead.


Assessing speaking through multimodal oral presentations: The case of construct  underrepresentation in EAP contexts

Louise Palmour, Department of University of Southampton, UK; University of Glasgow, UK.


Abstract This article explores the nature of the construct underlying classroom-based English for academic purpose (EAP) oral presentation assessments, which are used, in part, to determine admission to programmes of study at UK universities. Through analysis of qualitative data (from questionnaires, interviews, rating discussions, and fieldnotes), the article highlights how, in EAP settings, there is a tendency for the rating criteria and EAP teacher assessors to sometimes focus too narrowly on particular spoken linguistic aspects of oral presentations. This is in spite of student assessees drawing on, and teacher assessors valuing, the multimodal communicative affordances available in oral presentation performances. To better avoid such construct underrepresentation, oral presentation tasks should be acknowledged and represented in rating scales, teacher assessor decision-making, and training in EAP contexts.


Key words Construct validity, English for academic purposes, multimodality, oral presentation, qualitative research



The relationship between written discourse features and integrated listening-to-write scores for adolescent English language learners

Ray J. T. Liao, Department of National Taiwan Ocean University, Taiwan.

Kwangmin Lee, Department of University of Iowa, USA.

 

Abstract As integrated writing tasks in large-scale and classroom-based writing assessments have risen in popularity, research studies have increasingly concentrated on providing validity evidence. Given the fact that most of these studies focus on adult second language learners rather than younger ones, this study examined the relationship between written discourse features, vocabulary support, and integrated listening-to-write scores for adolescent English learners. The participants of this study consisted of 198 Taiwanese high school students who completed two integrated listening-to-write tasks. Prior to each writing task, a list of key vocabulary was provided to aid the students’ comprehension of the listening passage. Their written products were coded and analyzed for measures of discourse features and vocabulary use, including complexity, accuracy, fluency, organization, vocabulary use ratio, and vocabulary use accuracy. We then adopted descriptive statistics and hierarchical linear regression analyses to investigate the extent to which such measures were predictive of integrated listening-to-write test scores. The results showed that fluency, organization, grammatical accuracy, and vocabulary use accuracy were significant predictors of the writing test scores. Moreover, the results revealed that providing vocabulary support may not necessarily jeopardize the validity of integrated listening-to-write tasks. The implications for research and test development were also discussed.


Key words Academic writing, adolescent English language learners, discourse features, L2 integrated writing assessment, text analysis


English foreign language reading and spelling diagnostic assessments informing teaching and learning of young learners

Janina Kahn-Horwitz, Department of Oranim College, Israel.

Zahava Goldstein, Department of University of Haifa, Israel.



Abstract In order to inform English foreign language (EFL) diagnostic assessment of literacy, this study examined the extent to which 175 first-language Hebrew-speaking EFL young learners from fifth to tenth grade exhibited differences in single-letter grapheme recognition, sub-word, and word reading, and rapid automatized naming (RAN) of letters and numbers. In addition, this cross-sectional quasi-experimental quantitative study examined correlations between the aforementioned literacy components and oral reading speed, spelling, vocabulary, syntax, and morphological awareness. There were no differences between the grades for single-letter grapheme recognition, and participants demonstrated incomplete automatic recognition for this task. Sub-word recognition improved across grades. However, the results highlighted a lack of mastery. Sub-word recognition correlated with word reading and spelling throughout. RAN speeded measures and oral reading speed correlated with sub-word, word recognition, and spelling in the older grades illustrating the presence of accuracy and speed components. Correlations across grades between literacy components and vocabulary, syntax, and morphological awareness provided support for theories explaining how knowledge of multiple layers of words contributes to literacy acquisition. These results comprising EFL diagnostic assessment can inform reading and spelling teaching and learning.


Key words Diagnostic assessment, English as a Foreign Language, reading, spelling, struggling students, young learners



Establishing meaning recall and meaning recognition vocabulary knowledge as distinct psychometric constructs in relation to reading proficiency

Jeffrey Stewart, Department of Tokyo University of Science, Japan.

Henrik Gyllstadhttps,  Department of Lund University, Sweden.


Abstract The purpose of this paper is to (a) establish whether meaning recall and meaning recognition item formats test psychometrically distinct constructs of vocabulary knowledge which measure separate skills, and, if so, (b) determine whether each construct possesses unique properties predictive of L2 reading proficiency. Factor analyses and hierarchical regression were conducted on results derived from the two vocabulary item formats in order to test this hypothesis. The results indicated that although the two-factor model had better fit and meaning recall and meaning recognition can be considered distinct psychometrically, discriminant validity between the two factors is questionable. In hierarchical regression models, meaning recognition knowledge did not make a statistically significant contribution to explaining reading proficiency over meaning recall knowledge. However, when the roles were reversed, meaning recall did make a significant contribution to the model beyond the variance explained by meaning recognition alone. The results suggest that meaning recognition does not tap into unique aspects of vocabulary knowledge and provide empirical support for meaning recall as a superior predictor of reading proficiency for research purposes.


Key words Meaning recall, meaning recognition, reading, TOEIC, vocabulary testing



Diagnosing Chinese EFL learners’ writing ability using polytomous cognitive diagnostic models 

Xiaoting Shi, Department of Xi’an Jiaotong University, China.

Xiaomei Ma, Department of Xi’an Jiaotong University, China.


Abstract Cognitive diagnostic assessment (CDA) intends to identify learners’ strengths and weaknesses in latent cognitive attributes to provide personalized remedial instructions. Previous CDA studies on English as a Foreign Language (EFL)/English as a Second Language (ESL) writing have adopted dichotomous cognitive diagnostic models (CDMs) to analyze data from checklists using simple yes/no judgments. Compared to descriptors with multiple levels, descriptors with only yes/no judgments were considered too absolute, potentially resulting in misjudgment of learners’ writing ability. However, few studies have used polytomous CDMs to analyze graded response data from rating scales to diagnose writing ability. This study applied polytomous CDMs to diagnose 1166 EFL learners’ writing performance scored with a three-level rating scale. The sG-DINA model was selected after comparing model-data fit statistics of multiple polytomous CDMs. The results of classification accuracy indices and item discrimination indices further demonstrated that sGDINA had good performance on identifying learners’ strengths and weaknesses. The generated diagnostic information at group and individual levels was further synthesized into a personalized  diagnostic report, although its usefulness still requires further investigation. The findings provided evidence for the feasibility of applying polytomous CDM in EFL writing assessment.


Key words Cognitive diagnostic assessment, polytomous cognitive diagnostic models, multiple polytomous cognitive diagnistic model comparison, EFL writing, diagnostic repor


Development of the American Sign Language Fingerspelling and Numbers Comprehension Test (ASL FaN-CT)

Corrine Occhino, Department of  Syracuse University, USA.

 Ryan Lidster, Department of Indiana University Bloomington, USA

Abstract We describe the development and initial validation of the “ASL Fingerspelling and Number Comprehension Test” (ASL FaN-CT), a test of recognition proficiency for fingerspelled words in American Sign Language (ASL). Despite the relative frequency of fingerspelling in ASL discourse, learners commonly struggle to produce and perceive fingerspelling more than they do other facets of ASL. However, assessments of fingerspelling knowledge are highly underrepresented 

in the testing literature for signed languages. After first describing the construct, we describe test development, piloting, revisions, and evaluate the strength of the test’s validity argument vis-à-vis its intended interpretation and use as a screening instrument for current and future employees. The results of a pilot on 79 ASL learners provide strong evidence that the revised  test is performing as intended and can be used to make accurate decisions about ASL learners’ proficiency in fingerspelling recognition. We conclude by describing the item properties observed in our current test, and our plans for continued validation and analysis with respect to a battery 

of tests of ASL proficiency currently in development.


Key words American Sign Language (ASL), ASL vocabulary, assessment, comprehension, fingerspelling, receptive test




 Critical discursive approaches to evaluating policy-driven testing: Social impact as a target for validation

Dongil Shin, Department of Chung-Ang University, Kore


Abstract This paper addresses the intersection of testing and policy, situating test-driven impact and validation within the context of policy-led educational reform in Korea. I will briefly review the existing validation models. Then, arguing for an expansion of the conventional conceptualization of consequential validity research, I use Fairclough’s dialectic–relational approach in critical discourse analysis (CDA), positioned in critical and poststructuralist research tradition, to evaluate social realities, such as intended and actual impact of policy-led testing, I take, as an example, the context of the development of the National English Ability Test (NEAT) in Korea, which had been used as a means of implementing government policies. Combining Messick’s validity framework for consequential evidence, Bachman and Palmer’s argument-based approach 

to validation (assessment use argument, AUA), and Fairclough’s dialectic–relational approach, I will illustrate how the impact of policy-led testing is performed and interpreted as a sociopolitical and discursive phenomenon, constituted and enacted in and through “discourse.” By revisiting the previous Faircloughian research works on NEAT’s impact, I postulate that the discourses arguing for and against social impact acquire their meanings from dialectical standpoints.


Key words Argument-based approach to validation, critical discourse analysis, critical discursive approach, critical language testing, discursive turn in language testing, Fairclough’s approach, NEAT (National English Ability Test), policy-driven testing, social impact, sociopolitical dimension of language testint


Rethinking student placement to enhance efficiency and student agency

Beverly Baker , Department of University of Ottawa, Canada.

 Angel Arias , Department of Carleton University, Canada


Abstract Placement tests are used to support a particular need in a local context—to determine the best starting place for a student entering a specific programme of language study. This brief report will focus on the development of an innovative placement test with self-directed elements for our local needs at a university in Canada for students studying English or French as a second language. Our goals are to produce a more efficient assessment instrument while allowing students more 

agency through the process. We hope that sharing these details will encourage others to consider the potential of incorporating self-directed elements in low-stakes placement decision-making  French Les tests de classement sont utilisés pour répondre aux besoins spécifiques d’un contexte local. Ils servent à déterminer le meilleur point de départ pour chaque étudiante et étudiant qui commence un programme spécifique en langue seconde. Ce bref rapport se concentre sur le développement d’un test d’évaluation innovant avec des éléments de classement autodirigés pour nos besoins locaux dans une université au Canada s’adressant aux étudiantes et aux étudiants qui souhaitent apprendre l’anglais ou le français comme langue seconde. Notre objectif est de mettre au point un outil d’évaluation plus efficace, qui laisse aussi aux étudiantes et aux étudiants une plus grande marge de manœuvre dans le processus de classement. Nous espérons que ce partage encouragera d’autres personnes à considérer l’intégration d’éléments de classement autodirigés dans la prise de décision pour les classements à faibles enjeux.


Key words Directed self placement, ESL testing, local language test development, placement assessment, test taker judgements




Practical considerations when building concordances between English tests 

Ramsey L. Cardwell, Department of Duolingo, USA.

Steven W. Nydick,  Department of Duolingo, USA.


Abstract Applicants must often demonstrate adequate English proficiency when applying to postsecondary institutions by taking an English language proficiency test, such as the TOEFL iBT, IELTS Academic, or Duolingo English Test (DET). Concordance tables aim to provide equivalent scores across multiple assessments, helping admissions officers to make fair decisions regardless of the test that an applicant took. We present our approaches to addressing practical (i.e., data collection and analysis) challenges in the context of building concordance tables between overall scores from the DET and those from the TOEFL iBT and IELTS Academic tests. We summarize a novel method for combining self-reported and official scores to meet recommended minimum sample sizes for concordance studies. We also evaluate sensitivity of estimated concordances to choices about how to (a) weight the observed data to the target population; (b) define outliers; (c) select appropriate pairs of test scores for repeat test takers; and (d) compute equating functions between pairs of scores. We find that estimated concordance functions are largely robust to different combinations of these choices in the regions of the proficiency distribution most relevant to admissions decisions. We discuss implications of our results for both test users and language testers.


Key words achievement tests, admissions testing, concordance, English proficiency testing, equating, higher education, test validity


Our validity looks like justice. Does yours?  

Jennifer Randall, Department of University of Michigan, USA.


Abstract Educational assessments, from kindergarden to 12th grade (K-12) to licensure, have a long, well-documented history of oppression and marginalization. In this paper, we (the authors) ask the field of educational assessment/measurement to actively disrupt the White supremacist and racist logics that fuel this marginalization and re-orient itself toward assessment justice. We describe how a justice-oriented, antiracist validity (JAV) approach to validation processes can support assessment justice efforts, specifically with respect to language assessment. Relying on antiracist principles and critical quantitative methodologies, a JAV approach proposes a set of critical questions to consider when gathering validity evidence, with potential utility for language testers.


Key words Antiracist validity, justice-oriented assessment, justice-oriented measurement, justice-oriented validity, language assessment, validity

Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university

Daniel R. Isbell, Department of University of Hawaiʻi at Mānoa, USA.


Abstract The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability (r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores (r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.


Key words Academic acceptability, admissions testing, comprehensibility, Duolingo English Test, extrapolation, linguistic laypersons



Fairness of using different English accents: The effect of shared L1s in listening tasks of the Duolingo English test

Okim Kanghttps, Department of Northern Arizona University, USA.

Xun Yanhttps, Department of University of Illinois Urbana–Champaign, USA.


Abstract This study aimed to answer an ongoing validity question related to the use of nonstandard English accents in international tests of English proficiency and associated issues of test fairness. More specifically, we examined (1) the extent to which different or shared English accents had an impact on listeners’ performances on the Duolingo listening tests and (2) the extent to which different English accents affected listeners’ performances on two different task types. Speakers from four interlanguage English accent varieties (Chinese, Spanish, Indian English [Hindi], and Korean) produced speech samples for “yes/no” vocabulary and dictation Duolingo listening tasks. Listeners who spoke with these same four English accents were then recruited to take the Duolingo listening test items. Results suggested that there is a shared first language (L1) benefit effect overall, with comparable test scores between shared-L1 and inner-circle L1 accents, and no significant differences in listeners’ listening performance scores across highly intelligible accent varieties. No task type effect was found. The findings provide guidance to better understand fairness, equality, and practicality of designing and administering high-stakes English tests targeting a diversity of accents.


Key words Accent varieties, assessing listening, attitudes, Global Englishes, listening tasks



Revisiting raters’ accent familiarity in speaking tests: Evidence that presentation mode interacts with accent familiarity to variably affect comprehensibility ratings

Michael D. Carey, Department of University of the Sunshine Coast, Australia.

Stefan Szocs, Department of University of the Sunshine Coast, Australia.

Abstract The extrapolation of test scores to a target domain—that is, association between test performances and relevant real-world outcomes—is critical to valid score interpretation and use. This study examined the relationship between Duolingo English Test (DET) speaking scores and university stakeholders’ evaluation of DET speaking performances. A total of 190 university stakeholders (45 faculty members, 39 administrative staff, 53 graduate students, 53 undergraduate students) evaluated the comprehensibility (ease of understanding) and academic acceptability of 100 DET test-takers’ speaking performances. Academic acceptability was judged based on speakers’ suitability for communicative roles in the university context including undergraduate study, group work in courses, graduate study, and teaching. Analyses indicated a large correlation between aggregate measures of comprehensibility and acceptability (r = .98). Acceptability ratings varied according to role: acceptability for teaching was held to a notably higher standard than acceptability for undergraduate study. Stakeholder groups also differed in their ratings, with faculty tending to be more lenient in their ratings of comprehensibility and acceptability than undergraduate students and staff. Finally, both comprehensibility and acceptability measures correlated strongly with speakers’ official DET scores and subscores (r ⩾ .74–.89), providing some support for the extrapolation of DET scores to academic contexts.


Key words Academic acceptability, admissions testing, comprehensibility, Duolingo English Test, extrapolation, linguistic laypersons



Assessing the content quality  of essays in content and language integrated learning: Exploring the construct from subject specialists’ perspectives


Takanori Sato, Department of Sophia University, Japan.


Abstract Assessing the content of learners’ compositions is a common practice in second language (L2) writing assessment. However, the construct definition of content in L2 writing assessment potentially underrepresents the target competence in content and language integrated learning (CLIL), which aims to foster not only L2 proficiency but also critical thinking skills and subject knowledge. This study aims to conceptualize the construct of content in CLIL by exploring subject specialists’ perspectives on essays’ content quality in a CLIL context. Eleven researchers of English as a lingua franca (ELF) rated the content quality of research-based argumentative essays on ELF submitted in a CLIL course and produced think-aloud protocols. This study explored some essay features that have not been considered relevant in language assessment but are essential in the CLIL context, including the accuracy of the content, presence and quality of research, and presence of elements required in academic essays. Furthermore, the findings of this study confirmed that the components of content often addressed in language assessment (e.g., elaboration and logicality) are pertinent to writing assessment in CLIL. The manner in which subject specialists construe the content quality of essays on their specialized discipline can deepen the current understanding of content in CLIL.


Key words Argumentative writing, CLIL, content, L2 writing, raters’ judgments, think-aloud protocols



Language testers and their place in the policy web 

Laura Schildt, Department of Ghent University, Belgium.

Bart Deygers, Department of Ghent University, Belgium.


Abstract In the context of policy-driven language testing for citizenship, a growing body of research examines the political justifications and ethical implications of language requirements and test use. However, virtually no studies have looked at the role that language testers play in the evolution of language requirements. Critical gaps remain in our understanding of language testers’ first hand experiences interacting with policymakers and how they perceive the use of tests in public policy. We examined these questions using an exploratory design and semi-structured interviews with 28 test executives representing 25 exam boards in 20 European countries. The interviews were transcribed and double coded in NVivo (weighted kappa = .83) using a priori and inductive coding. We used a horizontal analysis to evaluate responses by participant and a vertical analysis to identify between-case themes. Findings indicate that language testers may benefit from policy literacy to form part of policy webs wherein they can influence instrumental decisions concerning language in migration policy.

Key words Citizenship policies, high-stakes testing, language assessment literacy, language test developer, language testing for migration purposes, policy, policy literacy, score use, test use



Comparing two formats of data-driven rating scales for classroom assessment of pragmatic performance with roleplays


Yunwen Su, Department of University of Illinois Urbana-Champaign, USA.

Sun-Young Shin, Department of Indiana University, US.


Abstract Rating scales that language testers design should be tailored to the specific test purpose and score use as well as reflect the target construct. Researchers have long argued for the value of datadriven scales for classroom performance assessment, because they are specific to pedagogical tasks and objectives, have rich descriptors to offer useful diagnostic information, and exhibit robust content representativeness and stable measurement properties. This sequential mixed methods study compares two data-driven rating scales with multiple criteria that use different formats for pragmatic performance. They were developed using roleplays performed by 43 second-language learners of Mandarin—the hierarchical-binary (HB) scale, developed through close analysis of performance data, and the multi-trait (MT) scale derived from the HB, which has the same criteria but takes the format of an analytic scale. Results revealed the influence of format, albeit to a limited extent: MT showed a marginal advantage over HB in terms of overall reliability, practicality, and discriminatory power, though measurement properties of the two scales were largely comparable. All raters were positive about the pedagogical value of both scales. This study reveals that rater perceptions of the ease of use and effectiveness of both scales provide further insights into scale functioning.


Key words Data-driven scales, performance assessment, pragmatic competence, rating scale functioning, refusals, roleplay



Triangulating natural  language processing  (NLP)-based analysis of rater comments and many-facet Rasch measurement (MFRM): An innovative approach to investigating raters’ application of rating scales in writing assessment


Huiying Cai, Department of University of Illinois Urbana-Champaign, USA.

Xun Yan, Department of University of Illinois Urbana-Champaign, USA.


Abstract Rater comments tend to be qualitatively analyzed to indicate raters’ application of rating scales. This study applied natural language processing (NLP) techniques to quantify meaningful, behavioral information from a corpus of rater comments and triangulated that information with a many-facet Rasch measurement (MFRM) analysis of rater scores. The data consisted of ratings on 987 essays by 36 raters (a total of 3948 analytic scores and 1974 rater comments) on a post-admission English Placement Test (EPT) at a large US university. We computed a set of comment-based features based on the analytic components and evaluative language the raters used to infer whether raters were aligned to the scale. For data triangulation, we performed correlation analyses between the MFRM measures of rater performance and the comment-based measures. Although the EPT raters showed overall satisfactory performance, we found meaningful associations between rater comments and performance features. In particular, raters with higher precision and fit to what the Rasch model predicts used more analytic components and used evaluative language more  similar to the scale descriptors. These findings suggest that NLP techniques have the potential to help language testers analyze rater comments and understand rater behavior.


Key words Many-facet Rasch measurement, natural language processing, rater comments, rater performance, rating scal



The development of a Chinese vocabulary proficiency test (CVPT) for learners of Chinese as a second/foreign language


Haiwei Zhang, Department of  Peking University, China.

Peng Sun, Department of Minzu University of China, China.


Abstract In order to address the needs of the continually growing number of Chinese language learners, the present study developed and presented initial validation of a 100-item Chinese vocabulary proficiency test (CVPT) for learners of Chinese as a second/foreign language (CS/FL) using Item Response Theory among 170 CS/FL learners from Indonesia and 354 CS/FL learners from Thailand. Participants were required to translate or explain the meanings of the Chinese words using Indonesian or Thai. The results provided preliminary evidence for the construct validity of the CVPT for measuring CS/FL learners’ receptive Chinese vocabulary knowledge in terms of content, substantive, structural, generalizability, and external aspects. The translationbased CVPT was an attempt to measure CS/FL learners’ vocabulary proficiency by exploring their performance in a vocabulary translation task, potentially revealing test-takers’ high-degree vocabulary knowledge. Such a CVPT could be useful for Chinese vocabulary instruction and designing future Chinese vocabulary measurement tools.


Key words Chinese as a foreign language, Chinese as a second language, Chinese vocabulary proficiency test, vocabulary acquisition, vocabulary breadth test



Making each point count: Revising a local adaptation of the Jacobs et al.’s (1981) ESL COMPOSITION PROFILE rubric

Yu-Tzu Chang, Department of  University of Hawai‘i at Mānoa, USA.

Ann Tai Choe, Department of  University of Hawai‘i at Mānoa, USA; Hawai‘i Pacific University, USA.


Abstract In this Brief Report, we describe an evaluation of and revisions to a rubric adapted from the Jacobs et al.’s (1981) ESL COMPOSITION PROFILE, with four rubric categories and 20-point rating scales, in the context of an intensive English program writing placement test. Analysis of 4 years of rating data (2016–2021, including 434 essays) using many-facet Rasch measurement demonstrated that the 20-point rating scales of the Jacobs et al. rubric functioned poorly due to (a) questionably small distinctions in writing quality between successive score categories and (b) the presence of several disordered categories. We reanalyzed the score data after collapsing the 20-point scales into 4-point scales to simulate a revision to the rubric. This reanalysis appeared promising, with well-ordered and distinct score categories, and only a trivial decrease in person separation reliability. After implementing this revision to the rubric, we examined data from recent administrations (2022–2023, including 93 essays) to evaluate scale functioning. As in the  simulation, scale categories were well-ordered and distinct in operational rating. Moreover, no raters demonstrated exceedingly poor fit using the revised rubric. Findings hold implications for other programs adopting/adapting the PROFILE or a similar rubric.


Key words Analytic rubric, many-facet Rasch measurement, rubric revision, scale functioning, writing




期刊简介

Language Testing is an international peer reviewed journal that publishes original research on foreign, second, additional, and bi-/multi-/trans-lingual (henceforth collectively called L2) language testing, assessment, and evaluation. Since 1984 it has featured high impact L2 testing papers covering theoretical issues, empirical studies, and reviews. The journal's scope encompasses the testing, assessment, and evaluation of spoken and signed languages being learned as L2s by children and adults, and the use of tests as research and evaluation tools that are used to provide information on the language knowledge and language performance abilities of L2 learners. Many articles also contribute to methodological innovation and the practical improvement of L2 testing internationally. In addition, the journal publishes submissions that deal with L2 testing policy issues, including the use of tests for making high-stakes decisions about L2 learners in fields as diverse as education, employment, and international mobility.

《语言测试》是一份国际同行评审期刊,发表关于外国、第二、辅助和双/多/跨语言(以下统称为L2)语言测试、评估和评估的原创研究。自1984年以来,它以高影响力的L2测试论文为特色,涵盖理论问题、实证研究和评论。该期刊的范围包括对儿童和成人作为L2学习的口语和手语的测试和评估,以及使用测试作为研究和评估工具,用于提供有关语言知识和语言表现的信息L2学习者的能力。许多文章还为国际上二语测试的方法创新和实际改进做出了贡献。此外,该期刊还发表处理L2测试政策问题的论文,包括使用测试对L2学习者在教育、就业和国际流动等不同领域做出高风险决策。


The journal welcomes the submission of papers that deal with ethical and philosophical issues in L2 testing, as well as issues centering on L2 test design, validation, and technical matters. Also of concern is research into the washback and impact of L2 language test use, the consequences of testing on L2 learner groups, and ground-breaking uses of assessments for L2 learning. Additionally, the journal wishes to publish replication studies that help to embed and extend knowledge of generalisable findings in the field. Language Testing is committed to encouraging interdisciplinary research, and is keen to receive submissions which draw on current theory and methodology from different areas within second language acquisition, applied linguistics, educational measurement, psycholinguistics, general education, psychology, cognitive science, language policy, and other relevant subdisciplines that interface with language testing and assessment. Authors are encouraged to adhere to Open Science Initiatives.

该期刊欢迎提交涉及 L2 测试中的伦理和哲学问题的论文,以及以 L2 测试设计、验证和技术问题为中心的问题。同样值得关注的是对 L2 语言测试使用的反作用和影响、测试对 L2 学习者群体的影响以及 L2 学习评估的开创性使用的研究。此外,该杂志希望发表有助于贡献和扩展该领域可推广发现的知识的重复研究。《语言测试》 致力于鼓励跨学科研究,并接收来自二语习得、应用语言学、教育测量、心理语言学、通识教育、心理学、认知科学、语言政策和与语言测试和评估相关的其他相关子学科研究。鼓励作者遵守开放科学倡议。


官网地址:

https://journals.sagepub.com/home/LTJ

本文来源:Language Testing官网

点击文末“阅读原文”可跳转官网




推  荐




刊讯|《语言科学》2024年第4期

2024-09-27

刊讯|《北斗语言学刊》2023年第10辑

2024-09-26

刊讯|《澳门语言学刊》2024年第1期

2024-09-25

刊讯|SSCI 期刊《语言与认知》2024年第1-2期

2024-09-23

刊讯|《国际中文教育学报》2024年第1期

2024-09-22

刊讯|CSSCI 来源集刊《语言学研究》2023年第35辑

2024-09-21

刊讯|CSSCI 扩展版《华文教学与研究》2024年第3期

2024-09-19

刊讯|SSCI 期刊《国际双语教育与双语制》2024年第4-6期

2024-09-18

刊讯|《语言教学与研究》2024年第4期

2024-09-16


欢迎加入
“语言学心得交流分享群”“语言学考博/考研/保研交流群”


请添加“心得君”入群务必备注“学校/单位+研究方向/专业”

今日小编:氢氧根

  审     核:心得小蔓

转载&合作请联系

"心得君"

微信:xindejun_yyxxd

点击“阅读原文”可跳转下载

继续滑动看下一个
语言学心得
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存