Massey Documents by Type

Now showing 1 - 3 of 3

Automatically identifying errors in primary level math word problems generated by large language models : a research report submitted to School of Mathematical and Computational Sciences in partial fulfillment of the requirements for the degree of Master of Information Sciences, School of Mathematical and Computational Sciences, Massey University
(Massey University, 2025) Mai, Zhuonan
Ensuring the quality of mathematical word problems (MWPs) is essential for primary education. However, large language models (LLMs) struggle with error identification despite excelling in problem-solving. This research evaluates four LLMs – Mixtral-8x7B-Instruct-v0.1(Mixtral-8x7B), Meta-Llama-3.1-8B-Instruct (Llama-3.1 8B), DeepSeek-Math-7B-Instruct (DeepSeek-Math-7B), and Llama-3.2-3B-Instruct （Llama-3.2-3B, for detecting errors in a dataset that was generated by LLMs. This dataset contains 5,098 MWPs from U.S. grades 1–6. A comprehensive framework with 12 error categories is introduced, which goes beyond most categorization schemes used in prior research. By evaluating Zero-Shot (inference without any examples), One-Shot (inference with one example), and Three-Shot (inference with three examples) approaches, as well as fine-tuning, across four models in seven experiments, we found that small-scale model Llama-3.2-3B achieved the finest Zero-Shot accuracy of 90% with minimal resources of 6GB GPU memory, comparable to the larger model Mixtral-8x7B's fine-tuned accuracy rate of 90.62%. However, due to data noise and prompt complexity, fine-tuning yielded negative results, with an average accuracy of 78.48%. The complexity of the prompts reduced accuracy by up to 20% for the Mixtral-8x7B model. Safety biases, particularly in Llama-3.1 8B and Mixtral-8x7B, led to misclassifications when triggering safety words. Our findings highlight the efficacy of small-scale LLMs and concise prompts for educational applications while identifying challenges in fine-tuning and model bias. We propose future research directions that include noise-robust data preprocessing, refined prompt engineering, and adversarial fine-tuning. These approaches aim to enhance the reliability of LLMs in detecting errors in MWPs, thereby ensuring the validity of educational assessments and ultimately contributing to the advancement of high-quality foundational mathematics education.
Large Multi-Modal Model Cartographic Map Comprehension for Textual Locality Georeferencing
(Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025-08-15) Wijegunarathna K; Stock K; Jones CB; Sila-Nowicka K; Moore A; O’Sullivan D; Adams B; Gahegan M
Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multimodal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach (∼1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM's ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.
A Framework to Assess Multilingual Vulnerabilities of LLMs
(Association for Computing Machinery, 2025-05-23) Tang L; Bogahawatta N; Ginige Y; Xu J; Sun S; Ranathunga S; Seneviratne S
Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.

Massey Documents by Type

Browse

Filters

Settings

Sort By

Results per page

Search Results