A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala
Loading...
Files
Date
2025-06
DOI
Open Access Location
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier B.V.
Rights
CC BY 4.0
(c) 2025 The Author/s
(c) 2025 The Author/s
Abstract
This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of LMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER.
Description
Keywords
Named entity recognition, Pre-trained language models, Low resource languages, Sinhala, Tamil, Large language models (LLMs)
Citation
Ranathunga S, Ranasinghe A, Shamal J, Dandeniya A, Galappaththi R, Samaraweera M. (2025). A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala. Natural Language Processing Journal. 11.
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as CC BY 4.0

