A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala
| dc.citation.volume | 11 | |
| dc.contributor.author | Ranathunga S | |
| dc.contributor.author | Ranasinghe A | |
| dc.contributor.author | Shamal J | |
| dc.contributor.author | Dandeniya A | |
| dc.contributor.author | Galappaththi R | |
| dc.contributor.author | Samaraweera M | |
| dc.date.accessioned | 2025-12-03T01:44:20Z | |
| dc.date.issued | 2025-06 | |
| dc.description.abstract | This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of LMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER. | |
| dc.description.confidential | false | |
| dc.edition.edition | June 2025 | |
| dc.identifier.citation | Ranathunga S, Ranasinghe A, Shamal J, Dandeniya A, Galappaththi R, Samaraweera M. (2025). A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala. Natural Language Processing Journal. 11. | |
| dc.identifier.doi | 10.1016/j.nlp.2025.100160 | |
| dc.identifier.eissn | 2949-7191 | |
| dc.identifier.elements-type | journal-article | |
| dc.identifier.issn | 2949-7191 | |
| dc.identifier.number | 100160 | |
| dc.identifier.pii | S2949719125000366 | |
| dc.identifier.uri | https://mro.massey.ac.nz/handle/10179/73893 | |
| dc.language | English | |
| dc.publisher | Elsevier B.V. | |
| dc.publisher.uri | https://www.sciencedirect.com/science/article/pii/S2949719125000366 | |
| dc.relation.isPartOf | Natural Language Processing Journal | |
| dc.rights | CC BY 4.0 | |
| dc.rights | (c) 2025 The Author/s | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Named entity recognition | |
| dc.subject | Pre-trained language models | |
| dc.subject | Low resource languages | |
| dc.subject | Sinhala | |
| dc.subject | Tamil | |
| dc.subject | Large language models (LLMs) | |
| dc.title | A multi-way parallel named entity annotated corpus for English, Tamil and Sinhala | |
| dc.type | Journal article | |
| pubs.elements-id | 608340 | |
| pubs.organisational-group | Other |

