Automatically identifying errors in primary level math word problems generated by large language models : a research report submitted to School of Mathematical and Computational Sciences in partial fulfillment of the requirements for the degree of Master of Information Sciences, School of Mathematical and Computational Sciences, Massey University

Loading...
Thumbnail Image

Date

2025

DOI

Open Access Location

Journal Title

Journal ISSN

Volume Title

Publisher

Massey University

Rights

(c) The author

Abstract

Ensuring the quality of mathematical word problems (MWPs) is essential for primary education. However, large language models (LLMs) struggle with error identification despite excelling in problem-solving. This research evaluates four LLMs – Mixtral-8x7B-Instruct-v0.1(Mixtral-8x7B), Meta-Llama-3.1-8B-Instruct (Llama-3.1 8B), DeepSeek-Math-7B-Instruct (DeepSeek-Math-7B), and Llama-3.2-3B-Instruct (Llama-3.2-3B, for detecting errors in a dataset that was generated by LLMs. This dataset contains 5,098 MWPs from U.S. grades 1–6. A comprehensive framework with 12 error categories is introduced, which goes beyond most categorization schemes used in prior research. By evaluating Zero-Shot (inference without any examples), One-Shot (inference with one example), and Three-Shot (inference with three examples) approaches, as well as fine-tuning, across four models in seven experiments, we found that small-scale model Llama-3.2-3B achieved the finest Zero-Shot accuracy of 90% with minimal resources of 6GB GPU memory, comparable to the larger model Mixtral-8x7B's fine-tuned accuracy rate of 90.62%. However, due to data noise and prompt complexity, fine-tuning yielded negative results, with an average accuracy of 78.48%. The complexity of the prompts reduced accuracy by up to 20% for the Mixtral-8x7B model. Safety biases, particularly in Llama-3.1 8B and Mixtral-8x7B, led to misclassifications when triggering safety words. Our findings highlight the efficacy of small-scale LLMs and concise prompts for educational applications while identifying challenges in fine-tuning and model bias. We propose future research directions that include noise-robust data preprocessing, refined prompt engineering, and adversarial fine-tuning. These approaches aim to enhance the reliability of LLMs in detecting errors in MWPs, thereby ensuring the validity of educational assessments and ultimately contributing to the advancement of high-quality foundational mathematics education.

Description

Keywords

Math Word Problems, Fine-Tuning, Large Language Models, LLMs

Citation

Endorsement

Review

Supplemented By

Referenced By