diff --git a/data/xml/2025.aimecon.xml b/data/xml/2025.aimecon.xml new file mode 100644 index 0000000000..700684a66b --- /dev/null +++ b/data/xml/2025.aimecon.xml @@ -0,0 +1,981 @@ + + + + + Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers + JoshuaWilson + ChristopherOrmerod + MagdalenBeiting Parrish + National Council on Measurement in Education (NCME) +
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
+ October + 2025 + 2025.aimecon-main + aimecon + 979-8-218-84228-4 + + + 2025.aimecon-main.0 + aime-con-2025-main + + + Input Optimization for Automated Scoring in Reading Assessment + Ji YoonJungBoston College + UmmugulBezirhanBoston College + Matthiasvon DavierBoston College + 1-8 + This study examines input optimization for enhanced efficiency in automated scoring (AS) of reading assessments, which typically involve lengthy passages and complex scoring guides. We propose optimizing input size using question-specific summaries and simplified scoring guides. Findings indicate that input optimization via compression is achievable while maintaining AS performance. + 2025.aimecon-main.1 + jung-etal-2025-input + + + Implementation Considerations for Automated <fixed-case>AI</fixed-case> Grading of Student Work + ZeweiTian + AlexLiuUniversity of Washington + LiefEsbenshadeUniversity of Washington + ShawonSarkarUniversity of Washington + ZacharyZhangHensun Innovation + KevinHeHensun Innovation + MinSunUniversity of Washington + 9-20 + 19 K-12 teachers participated in a co-design pilot study of an AI education platform, testing assessment grading. Teachers valued AI’s rapid narrative feedback for formative assessment but distrusted automated scoring, preferring human oversight. Students appreciated immediate feedback but remained skeptical of AI-only grading, highlighting needs for trustworthy, teacher-centered AI tools. + 2025.aimecon-main.2 + tian-etal-2025-implementation + + + Compare Several Supervised Machine Learning Methods in Detecting Aberrant Response Pattern + YiLuFederation of State Boards of Physical Therapy + YuZhangThe Federation of State Boards of Physical Therapy + LorinMuellerFederation of State Boards of Physical Therapy + 21-24 + An aberrant response pattern, e.g., a test taker is able to answer difficult questions correctly, but is unable to answer easy questions correctly, are first identified lz and lz*. We then compared the performance of five supervised machine learning methods in detecting aberrant response pattern identified by lz or lz*. + 2025.aimecon-main.3 + lu-etal-2025-compare + + + Leveraging multi-<fixed-case>AI</fixed-case> agents for a teacher co-design + HongwenGuoETS Research Institute + Matthew S.JohnsonETS Research Institute + LuisSaldiviaETS + MichelleWorthingtonETS + KadriyeErcikanETS + 25-34 + This study uses multi-AI agents to accelerate teacher co-design efforts. It innovatively links student profiles obtained from numerical assessment data to AI agents in natural languages. The AI agents simulate human inquiry, enrich feedback and ground it in teachers’ knowledge and practice, showing significant potential for transforming assessment practice and research. + 2025.aimecon-main.4 + guo-etal-2025-leveraging + + + Long context Automated Essay Scoring with Language Models + ChristopherOrmerodCambium Assessment + GititKehatCambium Assessment + 35-42 + In this study, we evaluate several models that incorporate architectural modifications to overcome the length limitations of the standard transformer architecture using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models. + 2025.aimecon-main.5 + ormerod-kehat-2025-long + + + Optimizing Reliability Scoring for <fixed-case>ILSA</fixed-case>s + Ji YoonJungBoston College + UmmugulBezirhanBoston College + Matthiasvon DavierBoston College + 43-49 + This study proposes an innovative method for evaluating cross-country scoring reliability (CCSR) in multilingual assessments, using hyperparameter optimization and a similarity-based weighted majority scoring within a single human scoring framework. Results show that this approach provides a cost-effective and comprehensive assessment of CCSR without the need for additional raters. + 2025.aimecon-main.6 + jung-etal-2025-optimizing + + + Exploring <fixed-case>AI</fixed-case>-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment + JillBursteinDuolingo + RamseyCardwellDuolingo + Ping-LinChuangDuolingo + AllisonMichalowskiDuolingo + StevenNydickDuolingo + 50-57 + We analyzed data from 25,969 test takers of a high-stakes, computer-adaptive English proficiency test to examine relationships between repeated use of AI-generated practice tests and performance, affect, and score-sharing behavior. Taking 1–3 practice tests was associated with higher scores and confidence, while higher usage showed different engagement and outcome + 2025.aimecon-main.7 + burstein-etal-2025-exploring + + + Develop a Generic Essay Scorer for Practice Writing Tests of Statewide Assessments + YiGuiThe University of Iowa + 58-81 + This study examines whether NLP transfer learning techniques, specifically BERT, can be used to develop prompt-generic AES models for practice writing tests. Findings reveal that fine-tuned DistilBERT, without further pre-training, achieves high agreement (QWK ≈ 0.89), enabling scalable, robust AES models in statewide K-12 assessments without costly supplementary pre-training. + 2025.aimecon-main.8 + gui-2025-develop + + + Towards assessing persistence in reading in young learners using pedagogical agents + CaitlinTenisonETS + BeataBeigman KelbanovETS + NoahSchroederUniversity of Florida + ShanZhangUniversity of Florida + MichaelSuhanEducational Testing Service + ChuyangZhangUniversity of Florida + 82-90 + This pilot study investigated the use of a pedagogical agent to administer a conversational survey to second graders following a digital reading activity, measuring comprehension, persistence, and enjoyment. Analysis of survey responses and behavioral log data provide evidence for recommendations for the design of agent-mediated assessment in early literacy. + 2025.aimecon-main.9 + tenison-etal-2025-towards + + + <fixed-case>LLM</fixed-case>-Based Approaches for Detecting Gaming the System in Self-Explanation + Jiayi (Joyce)ZhangUniversity of Pennsylvania + Ryan S.BakerUniversity of Pennsylvania + Bruce M.McLarenCarnegie Mellon University + 91-98 + This study compares two LLM-based approaches for detecting gaming behavior in students’ open-ended responses within a math digital learning game. The sentence embedding method outperformed the prompt-based approach and was more conservative. Consistent with prior research, gaming correlated negatively with learning, highlighting LLMs’ potential to detect disengagement in open-ended tasks. + 2025.aimecon-main.10 + zhang-etal-2025-llm-based + + + Evaluating the Impact of <fixed-case>LLM</fixed-case>-guided Reflection on Learning Outcomes with Interactive <fixed-case>AI</fixed-case>-Generated Educational Podcasts + VishnuMenonDrexel University + AndyCherneyDrexel University + Elizabeth B.CloudeMichigan State University + LiZhangDrexel University + Tiffany DiemDoDrexel University + 99-106 + This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design. + 2025.aimecon-main.11 + menon-etal-2025-evaluating + + + Generative <fixed-case>AI</fixed-case> in the K–12 Formative Assessment Process: Enhancing Feedback in the Classroom + Mike ThomasMaksimchukKent Intermediate School District + EdwardRoeberMichigan Assessment Consortium + DavieStoreKent Intermediate School District + 107-110 + This paper explores how generative AI can enhance K–12 formative assessment by improving feedback, supporting task design, fostering student metacognition, and building teacher assessment literacy. It addresses challenges of equity, ethics, and implementation, offering practical strategies and case studies to guide responsible AI integration in classroom formative assessment practices. + 2025.aimecon-main.12 + maksimchuk-etal-2025-generative + + + Using Large Language Models to Analyze Students’ Collaborative Argumentation in Classroom Discussions + NhatTranUniversity of Pittsburgh + DianeLitman + AmandaGodleyUniversity of Pittsburgh + 111-125 + Collaborative argumentation enables students to build disciplinary knowledge and to think in disciplinary ways. We use Large Language Models (LLMs) to improve existing methods for collaboration classification and argument identification. Results suggest that LLMs are effective for both tasks and should be considered as a strong baseline for future research. + 2025.aimecon-main.13 + tran-etal-2025-using + + + Evaluating Generative <fixed-case>AI</fixed-case> as a Mentor Resource: Bias and Implementation Challenges + JiminLeeClark University + Alena GEspositoClark University + 126-133 + We explored how students’ perceptions of helpfulness and caring skew their ability to identify AI versus human mentorship responses. Emotionally resonant responses often lead to misattributions, indicating perceptual biases that shape mentorship judgments. The findings inform ethical, relational, and effective integration of AI in student support. + 2025.aimecon-main.14 + lee-esposito-2025-evaluating + + + <fixed-case>AI</fixed-case>-Based Classification of <fixed-case>TIMSS</fixed-case> Items for Framework Alignment + UmmugulBezirhanBoston College + Matthiasvon DavierBoston College + 134-141 + Large-scale assessments rely on expert panels to verify that test items align with prescribed frameworks, a labor-intensive process. This study evaluates the use of GPT-4o to classify TIMSS items to content domain, cognitive domain, and difficulty categories. Findings highlight the potential of language models to support scalable, framework-aligned item verification. + 2025.aimecon-main.15 + bezirhan-von-davier-2025-ai + + + Towards Reliable Generation of Clinical Chart Items: A Counterfactual Reasoning Approach with Large Language Models + JiaxuanLiUniversity of California Irvine + SaedRezayiNBME + PeterBaldwinNational Board of Medical Examiners + PolinaHarikNBME + VictoriaYanevaNational Board of Medical Examiners + 142-153 + This study explores GPT-4 for generating clinical chart items in medical education using three prompting strategies. Expert evaluations found many items usable or promising. The counterfactual approach enhanced novelty, and item quality improved with high-surprisal examples. This is the first investigation of LLMs for automated clinical chart item generation. + 2025.aimecon-main.16 + li-etal-2025-towards-reliable + + + Using Whisper Embeddings for Audio-Only Latent Token Classification of Classroom Management Practices + Wesley GriffithMorris + JessicaVitaleVanderbilt University + IsabelArveloVanderbilt University + 154-162 + In this study, we developed a textless NLP system using a fine-tuned Whisper encoder to identify classroom management practices from noisy classroom recordings. The model segments teacher speech from non-teacher speech and performs multi-label classification of classroom practices, achieving acceptable accuracy without requiring transcript generation. + 2025.aimecon-main.17 + morris-etal-2025-using + + + Comparative Study of Double Scoring Design for Measuring Mathematical Quality of Instruction + Jonathan KyleFosterUniversity at Albany + JamesDrimallaUniversity of Virginia + NursultanJapashovUniversity at Albany + 163-171 + The integration of automated scoring and addressing whether it might meet the extensive need for double scoring in classroom observation systems is the focus of this study. We outline an accessible approach for determining the interchangeability of automated systems within comparative scoring design studies. + 2025.aimecon-main.18 + foster-etal-2025-comparative + + + Toward Automated Evaluation of <fixed-case>AI</fixed-case>-Generated Item Drafts in Clinical Assessment + TazinAfrinNBME + Le AnHaHo Chi Minh City University of Foreign Languages and Information Technology + VictoriaYanevaNational Board of Medical Examiners + KeelanEvaniniNBME + StevenGoNBME + KristineDeRuchieNBME + MichaelHeiligNBME + 172-182 + This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development. + 2025.aimecon-main.19 + afrin-etal-2025-toward + + + Numeric Information in Elementary School Texts Generated by <fixed-case>LLM</fixed-case>s vs Human Experts + AnastasiaSmirnovaSan Francisco State University + Erin S.LeeUniversity of California, Berkeley + ShiyingLiSan Francisco State University + 183-191 + We analyze GPT-4o’s ability to represent numeric information in texts for elementary school children and assess it with respect to the human baseline. We show that both humans and GPT-4o reduce the amount of numeric information when adapting informational texts for children but GPT-4o retains more complex numeric types than humans do. + 2025.aimecon-main.20 + smirnova-etal-2025-numeric + + + Towards evaluating teacher discourse without task-specific fine-tuning data + BeataBeigman Klebanov + MichaelSuhanEducational Testing Service + Jamie N.MikeskaETS + 192-200 + Teaching simulations with feedback are one way to provide teachers with practice opportunities to help improve their skill. We investigated methods to build evaluation models of teacher performance in leading a discussion in a simulated classroom, particularly for tasks with little performance data. + 2025.aimecon-main.21 + beigman-klebanov-etal-2025-towards + + + Linguistic proficiency of humans and <fixed-case>LLM</fixed-case>s in <fixed-case>J</fixed-case>apanese: Effects of task demands and content + May LynnReeseSan Francisco State University + AnastasiaSmirnovaSan Francisco State University + 201-211 + We evaluate linguistic proficiency of humans and LLMs on pronoun resolution in Japanese, using the Winograd Schema Challenge dataset. Humans outperform LLMs in the baseline condition, but we find evidence for task demand effectss in both humans and LLMs. We also found that LLMs surpass human performance in scenarios referencing US culture, providing strong evidence for content effects. + 2025.aimecon-main.22 + reese-smirnova-2025-linguistic + + + Generative <fixed-case>AI</fixed-case> Teaching Simulations as Formative Assessment Tools within Preservice Teacher Preparation + Jamie N.MikeskaETS + AakankshaBhatiaExcelOne + ShreyashiHalderETS + TriciaMaxwellETS + BeataBeigman Klebanov + BennyLongwillETS + KashishBehlETS + CalliShekellThiel University + 212-220 + This paper examines how generative AI (GenAI) teaching simulations can be used as a formative assessment tool to gain insight into elementary preservice teachers’ (PSTs’) instructional abilities. This study investigated the teaching moves PSTs used to elicit student thinking in a GenAI simulation and their perceptions of the simulation’s + 2025.aimecon-main.23 + mikeska-etal-2025-generative + + + Using <fixed-case>LLM</fixed-case>s to identify features of personal and professional skills in an open-response situational judgment test + ColeWalshAcuity Insights + RodicaIvanAcuity Insights + Muhammad ZafarIqbalAcuity Insights + ColleenRobbAcuity Insights + 221-230 + Current methods for assessing personal and professional skills lack scalability due to reliance on human raters, while NLP-based systems for assessing these skills fail to demonstrate construct validity. This study introduces a new method utilizing LLMs to extract construct-relevant features from responses to an assessment of personal and professional skills. + 2025.aimecon-main.24 + walsh-etal-2025-using + + + Automated Evaluation of Standardized Patients with <fixed-case>LLM</fixed-case>s + AndrewEmersonNational Board of Medical Examiners + Le AnHaHo Chi Minh City University of Foreign Languages and Information Technology + KeelanEvaniniNBME + SuSomayNational Board of Medical Examiners + KevinFromeNational Board of Medical Examiners + PolinaHarikNBME + VictoriaYanevaNational Board of Medical Examiners + 231-238 + Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments. + 2025.aimecon-main.25 + emerson-etal-2025-automated + + + <fixed-case>LLM</fixed-case>-Human Alignment in Evaluating Teacher Questioning Practices: Beyond Ratings to Explanation + RuikunHouTechnical University of Munich + TimFüttererUniversity of Tübingen + BabetteBühlerTechnical University of Munich + PatrickSchreyerUniversity of Kassel + PeterGerjetsLeibniz-Institut für Wissensmedien + UlrichTrautweinUniversity of Tübingen + EnkelejdaKasneciTechnical University of Munich + 239-249 + This study investigates the alignment between large language models (LLMs) and human raters in assessing teacher questioning practices, moving beyond rating agreement to the evidence selected to justify their decisions. Findings highlight LLMs’ potential to support large-scale classroom observation through interpretable, evidence-based scoring, with possible implications for concrete teacher feedback. + 2025.aimecon-main.26 + hou-etal-2025-llm + + + Leveraging Fine-tuned Large Language Models in Item Parameter Prediction + SuhwaHanCambium Asessment + FrankRijmenCambium Assessment + Allison AmesBoykinNBME + SusanLottridgeCambium Assessment + 250-264 + The study introduces novel approaches for fine-tuning pre-trained LLMs to predict item response theory parameters directly from item texts and structured item attribute variables. The proposed methods were evaluated on a dataset over 1,000 English Language Art items that are currently in the operational pool for a large scale assessment. + 2025.aimecon-main.27 + han-etal-2025-leveraging-fine + + + How Model Size, Temperature, and Prompt Style Affect <fixed-case>LLM</fixed-case>-Human Assessment Score Alignment + JulieJungHarvard University + MaxLuHarvard University + Sina CholeBenkerUniversity of Münster + DogusDariciUniversity of Münster + 265-273 + We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment with human raters in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Findings reveal both the potential for scalable LLM-raters and the risks of relying on them exclusively. + 2025.aimecon-main.28 + jung-etal-2025-model + + + Assessing <fixed-case>AI</fixed-case> skills: A washback point of view + MeiravArieli-AttaliFordham University + BeataBeigman Klebanov + TenahaO’Reilly + DiegoZapata-RiveraETS + TamiSabag-ShushanNational Authority for Testing and Evaluation, Israel + ImanAwadieNational Authority for Testing and Evaluation, Israel + 274-280 + The emerging dominance of AI in the perception of skills-of-the-future makes assessing AI skills necessary to help guide learning. Creating an assessment of AI skills poses some new challenges. We examine those from the point of view of washback, and exemplify using two exploration studies conducted with 9th grade students. + 2025.aimecon-main.29 + arieli-attali-etal-2025-assessing + + + Using Generative <fixed-case>AI</fixed-case> to Develop a Common Metric in Item Response Theory + PeterBaldwinNational Board of Medical Examiners + 281-289 + We propose a method for linking independently calibrated item response theory (IRT) scales using large language models to generate shared parameter estimates across forms. Applied to medical licensure data, the approach reliably recovers slope values across all conditions and yields accurate intercepts when cross-form differences in item difficulty are small. + 2025.aimecon-main.30 + baldwin-2025-using + + + Augmented Measurement Framework for Dynamic Validity and Reciprocal Human-<fixed-case>AI</fixed-case> Collaboration in Assessment + TaiwoFeyijimiUniversity of Georgia + Daniel OOyeniranThe University of Alabama + OukayodeApataTexas A&M University + Henry SanmiMakinde + Hope OluwaseunAdegokeUniversity of North Carolina, Greensboro + JohnAjamobeTexas A&M University + JusticeDadzieThe University of Alabama + 290-296 + The proliferation of Generative Artificial Intelligence presents unprecedented opportunities and profound challenges for educational measurement. This study introduces the Augmented Measurement Framework grounded in four core principles. The paper discussed practical applications, implications for professional development and policy, and charts a research agenda for advancing this framework in educational measurement. + 2025.aimecon-main.31 + feyijimi-etal-2025-augmented + + + Patterns of Inquiry, Scaffolding, and Interaction Profiles in Learner-<fixed-case>AI</fixed-case> Collaborative Math Problem-Solving + ZilongPanLehigh University + ShenBaThe Education University of Hong Kong + ZiluJiangJohns Hopkins University + ChengluLiUniversity of Utah + 297-305 + This study investigates inquiry and scaffolding patterns between students and MathPal, a math AI agent, during problem-solving tasks. Using qualitative coding, lag sequential analysis, and Epistemic Network Analysis, the study identifies distinct interaction profiles, revealing how personalized AI feedback shapes student learning behaviors and inquiry dynamics in mathematics problem-solving activities. + 2025.aimecon-main.32 + pan-etal-2025-patterns + + + Pre-trained Transformer Models for Standard-to-Standard Alignment Study + Hye-JeongChoiHumRRO + ReeseButterfussCentriverse + MengFanHumRRO + 306-311 + The current study evaluated the accuracy of five pre-trained large language models (LLMs) in matching human judgment for standard-to-standard alignment study. Results demonstrated comparable performance LLMs across despite differences in scale and computational demands. Additionally, incorporating domain labels as auxiliary information did not enhance LLMs performance. These findings provide initial evidence for the viability of open-source LLMs to facilitate alignment study and offer insights into the utility of auxiliary information. + 2025.aimecon-main.33 + choi-etal-2025-pre + + + From Entropy to Generalizability: Strengthening Automated Essay Scoring Reliability and Sustainability + YiGuiThe University of Iowa + 312-328 + Generalizability Theory with entropy-derived stratification optimized automated essay scoring reliability. A G-study decomposed variance across 14 encoders and 3 seeds; D-studies identified minimal ensembles achieving G ≥ 0.85. A hybrid of one medium and one small encoder with two seeds maximized dependability per compute cost. Stratification ensured uniform precision across + 2025.aimecon-main.34 + gui-2025-entropy + + + Undergraduate Students’ Appraisals and Rationales of <fixed-case>AI</fixed-case> Fairness in Higher Education + VictoriaDelaneySan Diego State University + SundaySteinSan Diego State University and University of California, San Diego + LilySawiSan Diego State University + KatyaHernandez HollidaySan Diego State University and University of California, San Diego + 329-336 + To measure learning with AI, students must be afforded opportunities to use AI consistently across courses. Our interview study of 36 undergraduates revealed that students make independent appraisals of AI fairness amid school policies and use AI inconsistently on school assignments. We discuss tensions for measurement raised from students’ responses. + 2025.aimecon-main.35 + delaney-etal-2025-undergraduate + + + <fixed-case>AI</fixed-case>-Generated Formative Practice and Feedback: Performance Benchmarks and Applications in Higher Education + Rachelvan CampenhoutVitalSource + Michelle WeaverClarkVitalSource + Jeffrey S.DittelVitalSource + BillJeromeVitalSource + NickBrownVitalSource + BennyJohnsonVitalSource Technologies + 337-344 + Millions of AI-generated formative practice questions across thousands of publisher etextbooks are available for student use in higher education. We review the research to address both performance metrics for questions and feedback calculated from student data, and discuss the importance of successful applications in the classroom to maximize learning potential. + 2025.aimecon-main.36 + van-campenhout-etal-2025-ai + + + Beyond Agreement: Rethinking Ground Truth in Educational <fixed-case>AI</fixed-case> Annotation + Danielle RThomasCarnegie Mellon University + ConradBorchersCarnegie Mellon University + KenKoedingerCarnegie Mellon University + 345-351 + Humans are biased, inconsistent, and yet we keep trusting them to define “ground truth.” This paper questions the overreliance on inter-rater reliability in educational AI and proposes a multidimensional approach leveraging expert-based approaches and close-the-loop validity to build annotations that reflect impact, not just agreement. It’s time we do better. + 2025.aimecon-main.37 + thomas-etal-2025-beyond + + + Automated search algorithm for optimal generalized linear mixed models (<fixed-case>GLMM</fixed-case>s) + MiryeongKooUniversity of Illinois at Urbana-Champaign + JinmingZhangUniversity of Illinois at Urbana-Champaign + 352-358 + Only a limited number of predictors can be included in a generalized linear mixed model (GLMM) due to estimation algorithm divergence. This study aims to propose a machine learning based algorithm (e.g., random forest) that can consider all predictors without the convergence issue and automatically searches for the optimal GLMMs. + 2025.aimecon-main.38 + koo-zhang-2025-automated + + + Exploring the Psychometric Validity of <fixed-case>AI</fixed-case>-Generated Student Responses: A Study on Virtual Personas’ Learning Motivation + HuanxiaoWang + 359-366 + This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT-4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT-4o reproduced the AMS structure and distinct motivational subgroups. + 2025.aimecon-main.39 + wang-2025-exploring + + + Measuring Teaching with <fixed-case>LLM</fixed-case>s + MichaelHardyStanford University + 367-384 + This paper introduces custom Large Language Models using sentence-level embeddings to measure teaching quality. The models achieve human-level performance in analyzing classroom transcripts, outperforming average human rater correlation. Aggregate model scores align with student learning outcomes, establishing a powerful new methodology for scalable teacher feedback. Important limitations discussed. + 2025.aimecon-main.40 + hardy-2025-measuring + + + Simulating Rating Scale Responses with <fixed-case>LLM</fixed-case>s for Early-Stage Item Evaluation + OnurDemirkayaRiverside Insights + Hsin-RoWeiRiverside Insights + EvelynJohnsonRiverside Insights + 385-392 + This study explores the use of large language models to simulate human responses to Likert-scale items. A DeBERTa-base model fine-tuned with item text and examinee ability emulates a graded response model (GRM). High alignment with GRM probabilities and reasonable threshold recovery support LLMs as scalable tools for early-stage item evaluation. + 2025.aimecon-main.41 + demirkaya-etal-2025-simulating + + + Bias and Reliability in <fixed-case>AI</fixed-case> Safety Assessment: Multi-Facet Rasch Analysis of Human Moderators + ChunlingNiuThe University of the Incarnate Word + KellyBradleyUniversity of Kentucky + BiaoMaThe University of the Incarnate Word + BrianWaltmanThe University of the Incarnate Word + LorenCossetteThe University of the Incarnate Word + RuiJinShenzhen University + 393-397 + Using Multi-Facet Rasch Modeling on 36,400 safety ratings of AI-generated conversations, we reveal significant racial disparities (Asian 39.1%, White 28.7% detection rates) and content-specific bias patterns. Simulations show that diverse teams of 8-10 members achieve 70%+ reliability versus 62% for smaller homogeneous teams, providing evidence-based guidelines for AI-generated content moderation. + 2025.aimecon-main.42 + niu-etal-2025-bias + + + Dynamic <fixed-case>B</fixed-case>ayesian Item Response Model with Decomposition (<fixed-case>D</fixed-case>-<fixed-case>BIRD</fixed-case>): Modeling Cohort and Individual Learning Over Time + HansolLeeStanford University + Jason B.ChoCornell University + David S.MattesonCornell University + BenjaminDomingueStanford University + 398-405 + We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data. + 2025.aimecon-main.43 + lee-etal-2025-dynamic-bayesian + + + Enhancing Essay Scoring with <fixed-case>GPT</fixed-case>-2 Using Back Translation Techniques + AysegulGunduzUniversity of Alberta + MarkGierlUniversity of Alberta + OkanBulutUniversity of Alberta + 406-416 + This study evaluates GPT-2 (small) for automated essay scoring on the ASAP dataset. Back-translation (English–Turkish–English) improved performance, especially on imbalanced sets. QWK scores peaked at 0.77. Findings highlight augmentation’s value and the need for more advanced, rubric-aware models for fairer assessment. + 2025.aimecon-main.44 + gunduz-etal-2025-enhancing + + + Mathematical Computation and Reasoning Errors by Large Language Models + LiangZhangUniversity of Georgia + EdithGrafETS + 417-424 + We evaluate four LLMs (GPT-4o, o1, DeepSeek-V3, DeepSeek-R1) on purposely challenging arithmetic, algebra, and number-theory items. Coding final answers and step-level solutions correctness reveals performance gaps, improvement paths, and how accurate LLMs can strengthen mathematics assessment and instruction. + 2025.aimecon-main.45 + zhang-graf-2025-mathematical + +
+ + + Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress + JoshuaWilson + ChristopherOrmerod + MagdalenBeiting Parrish + National Council on Measurement in Education (NCME) +
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
+ October + 2025 + 2025.aimecon-wip + aimecon + 979-8-218-84229-1 + + + 2025.aimecon-wip.0 + aime-con-2025-wip + + + Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias + SiruiWuUniversity of British Columbia + DaijinYangNortheastern University + 1-13 + This study explores an AI-assisted approach for rewriting personality scale items to reduce social desirability bias. Using GPT-refined neutralized items based on the IPIP-BFM-50, we compare factor structures, item popularity, and correlations with the MC-SDS to evaluate construct validity and the effectiveness of AI-based item refinement in Chinese contexts. + 2025.aimecon-wip.1 + wu-yang-2025-automated + + + <fixed-case>AI</fixed-case> as a Mind Partner: Cognitive Impact in <fixed-case>P</fixed-case>akistan’s Educational Landscape + EmanKhalid + HammadJavaidLahore University of Management Sciences, LUMS + YashalWaseemLahore University of Management Sciences + Natasha SohailBarlasLahore University of Management Sciences + 14-19 + This study explores how high school and university students in Pakistan perceive and use generative AI as a cognitive extension. Drawing on the Extended Mind Theory, impact on critical thinking, and ethics are evaluated. Findings reveal over-reliance, mixed emotional responses, and institutional uncertainty about AI’s role in learning. + 2025.aimecon-wip.2 + khalid-etal-2025-ai + + + Detecting Math Misconceptions: An <fixed-case>AI</fixed-case> Benchmark Dataset + BethanyRittle-JohnsonVanderbilt University + RebeccaAdlerVanderbilt University + KelleyDurkinVanderbilt University + LBurleighThe Learning Agency + JulesKingThe Learning Agency + ScottCrossleyVanderbilt University + 20-24 + To harness the promise of AI for improving math education, AI models need to be able to diagnose math misconceptions. We created an AI benchmark dataset on math misconceptions and other instructionally-relevant errors, comprising over 52,000 explanations written over 15 math questions that were scored by expert human raters. + 2025.aimecon-wip.3 + rittle-johnson-etal-2025-detecting + + + Optimizing Opportunity: An <fixed-case>AI</fixed-case>-Driven Approach to Redistricting for Fairer School Funding + JordanAbbottNew America, Education Funding Equity Initiative + 25-33 + We address national educational inequity driven by school district boundaries using a comparative AI framework. Our models, which redraw boundaries from scratch or consolidate existing districts, generate evidence-based plans that reduce funding and segregation disparities, offering policymakers scalable, data-driven solutions for systemic reform. + 2025.aimecon-wip.4 + abbott-2025-optimizing + + + Automatic Grading of Student Work Using Simulated Rubric-Based Data and <fixed-case>G</fixed-case>en<fixed-case>AI</fixed-case> Models + YiyaoYangTeachers College, Columbia University + YaseminGulbaharTeachers College, Columbia University + 34-39 + Grading assessment in data science faces challenges related to scalability, consistency, and fairness. Synthetic dataset and GenAI enable us to simulate realistic code samples and automatically evaluate using rubric-driven systems. The research proposes an automatic grading system for generated Python code samples and explores GenAI grading reliability through human-AI comparison. + 2025.aimecon-wip.5 + yang-gulbahar-2025-automatic + + + Cognitive Engagement in <fixed-case>G</fixed-case>en<fixed-case>AI</fixed-case> Tutor Conversations: At-scale Measurement and Impact on Learning + KodiWeatherholtzKhan Academy + Kelli MillwoodHillKhan Academy + KristenDicerboKhan Academy + WaltWellsKhan Academy + PhillipGrimaldiKhan Academy + MayaMiller-VedamKhan Academy + CharlesHoggKhan Academy + BogdanYamkovenkoKhan Academy + 40-48 + We developed and validated a scalable LLM-based labeler for classifying student cognitive engagement in GenAI tutoring conversations. Higher engagement levels predicted improved next-item performance, though further research is needed to assess distal transfer and to disentangle effects of continued tutor use from true learning transfer. + 2025.aimecon-wip.6 + weatherholtz-etal-2025-cognitive + + + Chain-of-Thought Prompting for Automated Evaluation of Revision Patterns in Young Student Writing + TianwenLiUniversity of Pittsburgh + MichelleHongUniversity of Pittsburgh + Lindsay ClareMatsumuraUniversity of Pittsburgh + Elaine LinWangRand Corporation + DianeLitman + ZhexiongLiuUniversity of Pittsburgh + RichardCorrentiUniversity of Pittsburgh + 49-65 + This study explores the use of ChatGPT-4.1 as a formative assessment tool for identifying revision patterns in young adolescents’ argumentative writing. ChatGPT-4.1 shows moderate agreement with human coders on identifying evidence-related revision patterns and fair agreement on explanation-related ones. Implications for LLM-assisted formative assessment of young adolescent writing are discussed. + 2025.aimecon-wip.7 + li-etal-2025-chain-thought + + + Predicting and Evaluating Item Responses Using Machine Learning, Text Embeddings, and <fixed-case>LLM</fixed-case>s + EvelynJohnsonRiverside Insights + Hsin-RoWeiRiverside Insights + TongWu + HuanLiuRiverside Insights + 66-70 + This work-in-progress study compares the accuracy of machine learning and large language models to predict student responses to field-test items on a social-emotional learning assessment. We evaluate how well each method replicates actual responses and examine the item parameters generated by synthetic data to those derived from actual student data. + 2025.aimecon-wip.8 + johnson-etal-2025-predicting + + + Evaluating <fixed-case>LLM</fixed-case>-Based Automated Essay Scoring: Accuracy, Fairness, and Validity + YueHuangMeasurement Incorporated + JoshuaWilsonUniversity of Delaware + 71-83 + This study evaluates large language models (LLMs) for automated essay scoring (AES), comparing prompt strategies and fairness across student groups. We found that well-designed prompting helps LLMs approach traditional AES performance, but both differ from human scores for ELLs—the traditional model shows larger overrall gaps, while LLMs show subtler disparities. + 2025.aimecon-wip.9 + huang-wilson-2025-evaluating + + + Comparing <fixed-case>AI</fixed-case> tools and Human Raters in Predicting Reading Item Difficulty + HongliLiGeorgia State University + RoulaAldibGeorgia State University + ChadMarchongGeorgia State University + KevinFanFulton County Schools + 84-89 + This study compares AI tools and human raters in predicting the difficulty of reading comprehension items without response data. Predictions from AI models (ChatGPT, Gemini, Claude, and DeepSeek) and human raters are evaluated against empirical difficulty values derived from student responses. Findings will inform AI’s potential to support test development. + 2025.aimecon-wip.10 + li-etal-2025-comparing + + + When Machines Mislead: Human Review of Erroneous <fixed-case>AI</fixed-case> Cheating Signals + WilliamBelzakDuolingo + ChenhaoNiuDuolingo, Inc. + AngelOrtmann LeeDuolingo, Inc. + 90-97 + This study examines how human proctors interpret AI-generated alerts for misconduct in remote assessments. Findings suggest proctors can identify false positives, though confirmation bias and differences across test-taker nationalities were observed. Results highlight opportunities to refine proctoring guidelines and strengthen fairness in human oversight of automated signals in high-stakes testing. + 2025.aimecon-wip.11 + belzak-etal-2025-machines + + + Fairness in Formative <fixed-case>AI</fixed-case>: Cognitive Complexity in Chatbot Questions Across Research Topics + Alexandra BarryColbertCollege Board + Karen DWangSchool of Information, San Jose State University + 98-106 + This study evaluates whether questions generated from a socratic-style research AI chatbot designed to support project-based AP courses maintains cognitive complexity parity when inputted with research topics of controversial and non-controversial nature. We present empirical findings indicating no significant conversational complexity differences, highlighting implications for equitable AI use in formative assessment. + 2025.aimecon-wip.12 + colbert-wang-2025-fairness + + + Keystroke Analysis in Digital Test Security: <fixed-case>AI</fixed-case> Approaches for Copy-Typing Detection and Cheating Ring Identification + ChenhaoNiuDuolingo, Inc. + Yong-SiangShihDuolingo, Inc. + ManqianLiaoDuolingo + RuidongLiuDuolingo, Inc. + AngelOrtmann LeeDuolingo, Inc. + 107-116 + This project leverages AI-based analysis of keystroke and mouse data to detect copy-typing and identify cheating rings in the Duolingo English Test. By modeling behavioral biometrics, the approach provides actionable signals to proctors, enhancing digital test security for large-scale online assessment. + 2025.aimecon-wip.13 + niu-etal-2025-keystroke + + + Talking to Learn: A <fixed-case>S</fixed-case>o<fixed-case>TL</fixed-case> Study of Generative <fixed-case>AI</fixed-case>-Facilitated Feynman Reviews + Madeline RoseMattoxUniversity of Virginia + NatalieHutchinsUniversity of Virginia + Jamie JJiroutUniversity of Virginia + 117-124 + Structured Generative AI interactions have potential for scaffolding learning. This Scholarship of Teaching and Learning study analyzes 16 undergraduate students’ Feynman-style AI interactions (N=157) across a semester-long child-development course. Qualitative coding of the interactions explores engagement patterns, metacognitive support, and response consistency, informing ethical AI integration in higher education. + 2025.aimecon-wip.14 + mattox-etal-2025-talking + + + <fixed-case>AI</fixed-case>-Powered Coding of Elementary Students’ Small-Group Discussions about Text + CarlaFirettoArizona State University + P. KarenMurphyThe Pennsylvania State University + LinYanArizona State University + YueTangThe Pennsylvania State University + 125-134 + We report reliability and validity evidence for an AI-powered coding of 371 small-group discussion transcripts. Evidence via comparability and ground truth checks suggested high consistency between AI-produced and human-produced codes. Research in progress is also investigating reliability and validity of a new “quality” indicator to complement the current coding. + 2025.aimecon-wip.15 + firetto-etal-2025-ai + + + Evaluating the Reliability of Human–<fixed-case>AI</fixed-case> Collaborative Scoring of Written Arguments Using Rational Force Model + NorikoTakahashiM.S. in Computational Linguistics, Montclair State University + AbrahamOnuorahPhD in Teacher Education and Teacher Development, Montclair State University + AlinaReznitskayaMontclair State University + EvgenyChukharevIowa State University + ArielSykesMontclair State University + MicheleFlammiaIndependent researcher + JoeOylerMaynooth University + 135-140 + This study aims to improve the reliability of a new AI collaborative scoring system used to assess the quality of students’ written arguments. The system draws on the Rational Force Model and focuses on classifying the functional relation of each proposition in terms of support, opposition, acceptability, and relevance. + 2025.aimecon-wip.16 + takahashi-etal-2025-evaluating + + + Evaluating Deep Learning and Transformer Models on <fixed-case>SME</fixed-case> and <fixed-case>G</fixed-case>en<fixed-case>AI</fixed-case> Items + JoeBettsNational Council of State Boards of Nursing + WilliamMunteanNational Council of State Boards of Nursing + 141-146 + This study leverages deep learning, transformer models, and generative AI to streamline test development by automating metadata tagging and item generation. Transformer models outperform simpler approaches, reducing SME workload. Ongoing research refines complex models and evaluates LLM-generated items, enhancing efficiency in test creation. + 2025.aimecon-wip.17 + betts-muntean-2025-evaluating + + + Comparison of <fixed-case>AI</fixed-case> and Human Scoring on A Visual Arts Assessment + NingJiangMeasurement Incorporated + YueHuangMeasurement Incorporated + JieChenMeasurement Incorporated + 147-154 + This study examines reliability and comparability of Generative AI scores versus human ratings on two performance tasks—text-based and drawing-based—in a fourth-grade visual arts assessment. Results show GPT-4 is consistent, aligned with humans but more lenient, and its agreement with humans is slightly lower than that between human raters. + 2025.aimecon-wip.18 + jiang-etal-2025-comparison + + + Explainable Writing Scores via Fine-grained, <fixed-case>LLM</fixed-case>-Generated Features + James VBrunoPearson + LeeBeckerPearson + 155-165 + Advancements in deep learning have enhanced Automated Essay Scoring (AES) accuracy but reduced interpretability. This paper investigates using LLM-generated features to train an explainable scoring model. By framing feature engineering as prompt engineering, state-of-the-art language technology can be integrated into simpler, more interpretable AES models. + 2025.aimecon-wip.19 + bruno-becker-2025-explainable + + + Validating Generative <fixed-case>AI</fixed-case> Scoring of Constructed Responses with Cognitive Diagnosis + HyunjooKimUniversity of Illinois Urbana-Champaign + 166-177 + This research explores the feasibility of applying the cognitive diagnosis assessment (CDA) framework to validate generative AI-based scoring of constructed responses (CRs). The classification information of CRs and item-parameter estimates from cognitive diagnosis models (CDMs) could provide additional validity evidence for AI-generated CR scores and feedback. + 2025.aimecon-wip.20 + kim-2025-validating + + + Automated Diagnosis of Students’ Number Line Strategies for Fractions + ZhizhiWangRutgers University + DakeZhangRutgers University + MinLiUniversity of Washington + YuhanTaoColumbia University + 178-184 + This study aims to develop and evaluate an AI-based platform that automatically grade and classify problem-solving strategies and error types in students’ handwritten fraction representations involving number lines. The model development procedures, and preliminary evaluation results comparing with available LLMs and human expert annotations are reported. + 2025.aimecon-wip.21 + wang-etal-2025-automated + + + Medical Item Difficulty Prediction Using Machine Learning + Hope OluwaseunAdegokeUniversity of North Carolina, Greensboro + YingDuAmerican Board of Pediatrics + AndrewDwyerAmerican Board of Pediatrics + 185-190 + This project aims to use machine learning models to predict a medical exam item difficulty by combining item metadata, linguistic features, word embeddings, and semantic similarity measures with a sample size of 1000 items. The goal is to improve the accuracy of difficulty prediction in medical assessment. + 2025.aimecon-wip.22 + adegoke-etal-2025-medical + + + Examining decoding items using engine transcriptions and scoring in early literacy assessment + ZacharySchultzCambium Learning Group, Inc. + MackenzieYoung + DebbieDugdaleCambium Assessment, Inc. + SusanLottridgeCambium Assessment + 191-196 + We investigate the reliability of two scoring approaches to early literacy decoding items, whereby students are shown a word and asked to say it aloud. Approaches were rubric scoring of speech, human or AI transcription with varying explicit scoring rules. Initial results suggest rubric-based approaches perform better than transcription-based methods. + 2025.aimecon-wip.23 + schultz-etal-2025-examining + + + Addressing Few-Shot <fixed-case>LLM</fixed-case> Classification Instability Through Explanation-Augmented Distillation + WilliamMunteanNational Council of State Boards of Nursing + JoeBettsNational Council of State Boards of Nursing + 197-203 + This study compares explanation-augmented knowledge distillation with few-shot in-context learning for LLM-based exam question classification. Fine-tuned smaller language models achieved competitive performance with greater consistency than large mode few-shot approaches, which exhibited notable variability across different examples. Hyperparameter selection proved essential, with extremely low learning rates significantly impairing model performance. + 2025.aimecon-wip.24 + muntean-betts-2025-addressing + + + Identifying Biases in Large Language Model Assessment of Linguistically Diverse Texts + Lionel HsienMengUniversity of Wisconsin - Madison + ShamyaKarumbaiahUniversity of Wisconsin - Madison + VivekSaravananUniversity of Wisconsin - Madison + DanielBoltUniversity of Wisconsin - Madison + 204-210 + The development of Large Language Models (LLMs) to assess student text responses is rapidly progressing but evaluating whether LLMs equitably assess multilingual learner responses is an important precursor to adoption. Our study provides an example procedure for identifying and quantifying bias in LLM assessment of student essay responses. + 2025.aimecon-wip.25 + meng-etal-2025-identifying + + + Implicit Biases in Large Vision–Language Models in Classroom Contexts + PeterBaldwinNational Board of Medical Examiners + 211-217 + Using a counterfactual, adversarial, audit-style approach, we tested whether ChatGPT-4o evaluates classroom lectures differently based on teacher demographics. The model was told only to rate lecture excerpts embedded within classroom images—without reference to the images themselves. Despite this, ratings varied systematically by teacher race and sex, revealing implicit bias. + 2025.aimecon-wip.26 + baldwin-2025-implicit + + + Enhancing Item Difficulty Prediction in Large-scale Assessment with Large Language Model + MubarakMojoyinola + Olasunkanmi JamesKehindeNorfolk State University + JudyTangWestat + 218-222 + Field testing is a resource-intensive bottleneck in test development. This study applied an interpretable framework that leverages a Large Language Model (LLM) for structured feature extraction from TIMSS items. These features will train several classifiers, whose predictions will be explained using SHAP, providing actionable, diagnostic insights insights for item writers. + 2025.aimecon-wip.27 + mojoyinola-etal-2025-enhancing + + + Leveraging <fixed-case>LLM</fixed-case>s for Cognitive Skill Mapping in <fixed-case>TIMSS</fixed-case> Mathematics Assessment + Ruchi JSachdevaPearson + Jung YeonParkGeorge Mason University + 223-228 + This study evaluates ChatGPT-4’s potential to support validation of Q-matrices and analysis of complex skill–item interactions. By comparing its outputs to expert benchmarks, we assess accuracy, consistency, and limitations, offering insights into how large language models can augment expert judgment in diagnostic assessment and cognitive skill mapping. + 2025.aimecon-wip.28 + sachdeva-park-2025-leveraging + +
+ + + Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers + JoshuaWilson + ChristopherOrmerod + MagdalenBeiting Parrish + National Council on Measurement in Education (NCME) +
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
+ October + 2025 + 2025.aimecon-sessions + aimecon + 979-8-218-84230-7 + + + 2025.aimecon-sessions.0 + aime-con-2025-sessions + + + When Does Active Learning Actually Help? Empirical Insights with Transformer-based Automated Scoring + Justin OBarberPearson + Michael P.HemenwayPearson + EdwardWolfePearson + 1-8 + Developing automated essay scoring (AES) systems typically demands extensive human annotation, incurring significant costs and requiring considerable time. Active learning (AL) methods aim to alleviate this challenge by strategically selecting the most informative essays for scoring, thereby potentially reducing annotation requirements without compromising model accuracy. This study systematically evaluates four prominent AL strategies—uncertainty sampling, BatchBALD, BADGE, and a novel GenAI-based uncertainty approach—against a random sampling baseline, using DeBERTa-based regression models across multiple assessment prompts exhibiting varying degrees of human scorer agreement. Contrary to initial expectations, we found that AL methods provided modest but meaningful improvements only for prompts characterized by poor scorer reliability (<60% agreement per score point). Notably, extensive hyperparameter optimization alone substantially reduced the annotation budget required to achieve near-optimal scoring performance, even with random sampling. Our findings underscore that while targeted AL methods can be beneficial in contexts of low scorer reliability, rigorous hyperparameter tuning remains a foundational and highly effective strategy for minimizing annotation costs in AES system development. + 2025.aimecon-sessions.1 + barber-etal-2025-active + + + Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems + ChristopherOrmerodCambium Assessment + 9-18 + This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations – a generative language model used for spell correction and an encoder-based token-classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers. + 2025.aimecon-sessions.2 + ormerod-2025-automated + + + Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests + YanbinFuUniversity of Maryland, College Park + HongJiaoUniversity of Maryland + TianyiZhouUniversity of Maryland + NanZhangUniversity of Maryland + MingLiUniversity of Maryland + QingshuXuUniversity of Maryland, College Park + SydneyPetersUniversity of Maryland, College Park + Robert WLissitzUniversity of Maryland, College Park + 19-36 + Aligning test items to content standards is a critical step in test development to collect validity evidence 3 based on content. Item alignment has typically been conducted by human experts, but this judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for both domain and skill alignment. The model performance was evaluated using precision, recall, accuracy, weighted F1 score, and Cohen’s kappa on two test sets. The impact of input data types and training sample sizes was also explored. Results showed that including more textual inputs led to better performance gains than increasing sample size. For comparison, classic supervised machine learning classifiers were trained on multilingual-E5 embedding. Fine-tuned SLMs consistently outperformed these models, particularly for fine-grained skill alignment. To better understand model classifications, semantic similarity analyses including cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embedding revealed that certain skills in the two test datasets were semantically too close, providing evidence for the observed misclassification patterns. + 2025.aimecon-sessions.3 + fu-etal-2025-text + + + Review of Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments + SydneyPetersUniversity of Maryland, College Park + NanZhangUniversity of Maryland + HongJiaoUniversity of Maryland + MingLiUniversity of Maryland + TianyiZhouUniversity of Maryland + 37-47 + Item difficulty plays a crucial role in evaluating item quality, test form assembly, and interpretation of scores in large-scale assessments. Traditional approaches to estimate item difficulty rely on item response data collected in field testing, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and natural language processing have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessments. Each study is synthesized in terms of the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Overall, text-based models achieved moderate to high predictive performance, highlighting the potential of text-based item difficulty modeling to enhance the current practices of item quality evaluation. + 2025.aimecon-sessions.4 + peters-etal-2025-review + + + Item Difficulty Modeling Using Fine-Tuned Small and Large Language Models + MingLiUniversity of Maryland + HongJiaoUniversity of Maryland + TianyiZhouUniversity of Maryland + NanZhangUniversity of Maryland + SydneyPetersUniversity of Maryland, College Park + Robert WLissitzUniversity of Maryland, College Park + 48-55 + This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models. We introduce novel data augmentation strategies, including on-the-fly augmentation and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models such as BERT and RoBERTa yielded lower root mean squared error than the first-place winning model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among small language models enhanced prediction accuracy, reinforcing the benefits of ensemble learning. Large language models (LLMs), such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge. + 2025.aimecon-sessions.5 + li-etal-2025-item + + + Operational Alignment of Confidence-Based Flagging Methods in Automated Scoring + CoreyPalermoMeasurement Incorporated + TroyChenMeasurement Incorporated + AriantoWibowoMeasurement Incorporated + 56-60 + Correct answers to math problems don’t reveal if students understand concepts or just memorized procedures. Conversation-Based Assessment (CBA) addresses this through AI dialogue, but reliable scoring requires costly pilots and specialized expertise. Our Criteria Development Platform (CDP) enables pre-pilot optimization using synthetic data, reducing development from months to days. Testing 17 math items through 68 iterations, all achieved our reliability threshold (MCC ≥ 0.80) after refinement – up from 59% initially. Without refinement, 7 items would have remained below this threshold. By making reliability validation accessible, CDP empowers educators to develop assessments meeting automated scoring standards. + 2025.aimecon-sessions.6 + palermo-etal-2025-operational + + + Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data + TylerBurleighKhan Academy + JingChenKhan Academy + KristenDicerboKhan Academy + 61-68 + Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent “grading personalities” (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support. + 2025.aimecon-sessions.7 + burleigh-etal-2025-pre + + + When Humans Can’t Agree, Neither Can Machines: The Promise and Pitfalls of <fixed-case>LLM</fixed-case>s for Formative Literacy Assessment + OwenHenkelUniversitu of Oxford + KirkVanacoreCornell University + BillRobertsLegible Labs + 69-78 + Story retell assessments provide valuable insights into reading comprehension but face implementation barriers due to time-intensive administration and scoring. This study examines whether Large Language Models (LLMs) can reliably replicate human judgment in grading story retells. Using a novel dataset, we conduct three complementary studies examining LLM performance across different rubric systems, agreement patterns, and reasoning alignment. We find that LLMs (a) achieve near-human reliability with appropriate rubric design, (b) perform well on easy-to-grade cases but poorly on ambiguous ones, (c) produce explanations for their grades that are plausible for straightforward cases but unreliable for complex ones, and (d) different LLMs display consistent “grading personalities” (systematically scoring harder or easier across all student responses). These findings support hybrid assessment architectures where AI handles routine scoring, enabling more frequent formative assessment while directing teacher expertise toward students requiring nuanced support. + 2025.aimecon-sessions.8 + henkel-etal-2025-humans + + + Beyond the Hint: Using Self-Critique to Constrain <fixed-case>LLM</fixed-case> Feedback in Conversation-Based Assessment + TylerBurleighKhan Academy + JennyHanKhan Academy + KristenDicerboKhan Academy + 79-85 + Large Language Models in Conversation-Based Assessment tend to provide inappropriate hints that compromise validity. We demonstrate that self-critique – a simple prompt engineering technique – effectively constrains this behavior.Through two studies using synthetic conversations and real-world high school math pilot data, self-critique reduced inappropriate hints by 90.7% and 24-75% respectively. Human experts validated ground truth labels while LLM judges enabled scale. This immediately deployable solution addresses the critical tension in intermediate-stakes assessment: maintaining student engagement while ensuring fair comparisons. Our findings show prompt engineering can meaningfully safeguard assessment integrity without model fine-tuning. + 2025.aimecon-sessions.9 + burleigh-etal-2025-beyond + + + Investigating Adversarial Robustness in <fixed-case>LLM</fixed-case>-based <fixed-case>AES</fixed-case> + RenjithRavindranETS + IkkyuChoiETS + 86-91 + Automated Essay Scoring (AES) is one of the most widely studied applications of Natural Language Processing (NLP) in education and educational measurement. Recent advances with pre-trained Transformer-based large language models (LLMs) have shifted AES from feature-based modeling to leveraging contextualized language representations. These models provide rich semantic representations that substantially improve scoring accuracy and human–machine agreement compared to systems relying on handcrafted features. However, their robustness towards adversarially crafted inputs remains poorly understood. In this study, we define adversarial input as any modification of the essay text designed to fool an automated scoring system into assigning an inflated score. We evaluate a fine-tuned DeBERTa-based AES model on such inputs and show that it is highly susceptible to a simple text duplication attack, highlighting the need to consider adversarial robustness alongside accuracy in the development of AES systems. + 2025.aimecon-sessions.10 + ravindran-choi-2025-investigating + + + Effects of Generation Model on Detecting <fixed-case>AI</fixed-case>-generated Essays in a Writing Test + JiyunZuETS + MichaelFaussETS + ChenLiETS + 92-98 + Various detectors have been developed to detect AI-generated essays using labeled datasets of human-written and AI-generated essays, with many reporting high detection accuracy. In real-world settings, essays may be generated by models different from those used to train the detectors. This study examined the effects of generation model on detector performance. We focused on two generation models – GPT-3.5 and GPT-4 – and used writing items from a standardized English proficiency test. Eight detectors were built and evaluated. Six were trained on three training sets (human-written essays combined with either GPT-3.5-generated essays, or GPT-4-generated essays, or both) using two training approaches (feature-based machine learning and fine-tuning RoBERTa), and the remaining two were ensemble detectors. Results showed that a) fine-tuned detectors outperformed feature-based machine learning detectors on all studied metrics; b) detectors trained with essays generated from only one model were more likely to misclassify essays generated by the other model as human-written essays (false negatives), but did not misclassify more human-written essays as AI-generated (false positives); c) the ensemble fine-tuned RoBERTa detector had fewer false positives, but slightly more false negatives than detectors trained with essays generated by both models. + 2025.aimecon-sessions.11 + zu-etal-2025-effects + + + Exploring the Interpretability of <fixed-case>AI</fixed-case>-Generated Response Detection with Probing + IkkyuChoiETS + JiyunZuETS + 99-106 + Multiple strategies for AI-generated response detection have been proposed, with many high-performing ones built on language models. However, the decision-making processes of these detectors remain largely opaque. We addressed this knowledge gap by fine-tuning a language model for the detection task and applying probing techniques using adversarial examples. Our adversarial probing analysis revealed that the fine-tuned model relied heavily on a narrow set of lexical cues in making the classification decision. These findings underscore the importance of interpretability in AI-generated response detectors and highlight the value of adversarial probing as a tool for exploring model interpretability. + 2025.aimecon-sessions.12 + choi-zu-2025-exploring + + + A Fairness-Promoting Detection Objective With Applications in <fixed-case>AI</fixed-case>-Assisted Test Security + MichaelFaussETS + IkkyuChoiETS + 107-114 + A detection objective based on bounded group-wise false alarm rates is proposed to promote fairness in the context of test fraud detection. The paper begins by outlining key aspects and characteristics that distinguish fairness in test security from fairness in other domains and machine learning in general. The proposed detection objective is then introduced, the corresponding optimal detection policy is derived, and the implications of the results are examined in light of the earlier discussion. A numerical example using synthetic data illustrates the proposed detector and compares its properties to those of a standard likelihood ratio test. + 2025.aimecon-sessions.13 + fauss-choi-2025-fairness + + + The Impact of an <fixed-case>NLP</fixed-case>-Based Writing Tool on Student Writing + KarthikSairamCambium Assessment + AmyBurkhardtCambium Assessment + SusanLottridgeCambium Assessment + 115-123 + We present preliminary evidence on the impact of a NLP-based writing feedback tool, Write-On with Cambi! on students’ argumentative writing. Students were randomly assigned to receive access to the tool or not, and their essay scores were compared across three rubric dimensions; estimated effect sizes (Cohen’s d) ranged from 0.25 to 0.26 (with notable variation in the average treatment effect across classrooms). To characterize and compare the groups’ writing processes, we implemented an algorithm that classified each revision as Appended (new text added to the end), Surface-level (minor within-text corrections to conventions), or Substantive (larger within-text changes or additions). We interpret within-text edits (Surface-level or Substantive) as potential markers of metacognitive engagement in revision, and note that these within-text edits are more common in students who had access to the tool. Together, these pilot analyses serve as a first step in testing the tool’s theory of action. + 2025.aimecon-sessions.14 + sairam-etal-2025-impact + +
+
diff --git a/data/yaml/venues/aimecon.yaml b/data/yaml/venues/aimecon.yaml new file mode 100644 index 0000000000..7a942ed59f --- /dev/null +++ b/data/yaml/venues/aimecon.yaml @@ -0,0 +1,2 @@ +acronym: AIME-Con +name: Artificial Intelligence in Measurement and Education Conference (AIME-Con)