Linguistic entity masking to improve cross-lingual representation of multilingual language models for low-resource languages

Fernando A; Ranathunga S

Linguistic entity masking to improve cross-lingual representation of multilingual language models for low-resource languages

dc.citation.volume	Latest Articles
dc.contributor.author	Fernando A
dc.contributor.author	Ranathunga S
dc.date.accessioned	2025-07-29T21:10:32Z
dc.date.available	2025-07-29T21:10:32Z
dc.date.issued	2025-07-19
dc.description.abstract	Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.
dc.description.confidential	false
dc.identifier.citation	Fernando A, Ranathunga S. (2025). Linguistic entity masking to improve cross-lingual representation of multilingual language models for low-resource languages. Knowledge and Information Systems. Latest Articles.
dc.identifier.doi	10.1007/s10115-025-02520-4
dc.identifier.eissn	0219-3116
dc.identifier.elements-type	journal-article
dc.identifier.issn	0219-1377
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/73253
dc.language	English
dc.publisher	Springer-Verlag London Ltd
dc.publisher.uri	https://link.springer.com/article/10.1007/s10115-025-02520-
dc.relation.isPartOf	Knowledge and Information Systems
dc.rights	(c) The author/s	en
dc.rights.license	CC BY	en
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en
dc.subject	Masked language modelling
dc.subject	Translation language modelling
dc.subject	Multilingual pre-trained language model
dc.subject	Bitext mining
dc.subject	Sentiment analysis
dc.subject	XLM-R
dc.subject	Sinhala
dc.subject	Tamil
dc.title	Linguistic entity masking to improve cross-lingual representation of multilingual language models for low-resource languages
dc.type	Journal article
pubs.elements-id	501733
pubs.organisational-group	Other