Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

dc.citation.issue6
dc.citation.volume9
dc.contributor.authorWardle G
dc.contributor.authorSusnjak T
dc.date.accessioned2026-01-19T23:40:18Z
dc.date.issued2025-06-03
dc.description.abstractOur study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.
dc.description.confidentialfalse
dc.description.notesarchiveprefix: arXiv primaryclass: cs.AI
dc.edition.editionJune 2025
dc.identifier.citationWardle G, Sušnjak T. (2025). Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks. Big Data and Cognitive Computing. 9. 6.
dc.identifier.doi10.3390/bdcc9060149
dc.identifier.eissn2504-2289
dc.identifier.elements-typejournal-article
dc.identifier.number149
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/74043
dc.languageEnglish
dc.publisherMDPI (Basel, Switzerland)
dc.publisher.urihttps://www.mdpi.com/2504-2289/9/6/149
dc.relation.isPartOfBig Data and Cognitive Computing
dc.rightsCC BY 4.0
dc.rights(c) 2025 The Author/s
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectchain-of-thought reasoning
dc.subjecthuman–AI interaction
dc.subjectinteractive AI systems
dc.subjectmodality fusion
dc.subjectmulti-modal large language models
dc.subjectmulti-modal prompting
dc.subjectmulti-modal reasoning
dc.subjectuser-centred prompt engineering
dc.subjectuser-guided AI adaptation
dc.titleImage First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
dc.typeJournal article
pubs.elements-id500237
pubs.organisational-groupOther

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
500237 PDF.pdf
Size:
19.59 MB
Format:
Adobe Portable Document Format
Description:
Evidence

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:

Collections