SiTSE: Sinhala Text Simplification Dataset and Evaluation

Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this article, we present a human-curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and 3,000 corresponding simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero-resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-andEvaluation.

Keywords

Computing methodologies → Artificial intelligence, Natural language processing, Language resources

Citation

Ranathunga S, Sirithunga R, Rathnayake H, de Silva L, Aluthwala T, Peramuna S, Shekhar R. (2025). SiTSE: Sinhala Text Simplification Dataset and Evaluation. ACM Transactions on Asian and Low-Resource Language Information Processing. 24. 5. (pp. 1-19).

URI

https://mro.massey.ac.nz/handle/10179/73001

Collections

Journal Articles

Creative Commons license

Full item page

SiTSE: Sinhala Text Simplification Dataset and Evaluation

Files

Date

DOI

Open Access Location

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Rights

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license