A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala

Loading...
Thumbnail Image

Date

2025-06

DOI

Open Access Location

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier B.V.

Rights

CC BY 4.0
(c) 2025 The Author/s

Abstract

This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of LMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER.

Description

Keywords

Named entity recognition, Pre-trained language models, Low resource languages, Sinhala, Tamil, Large language models (LLMs)

Citation

Ranathunga S, Ranasinghe A, Shamal J, Dandeniya A, Galappaththi R, Samaraweera M. (2025). A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala. Natural Language Processing Journal. 11.

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as CC BY 4.0