SiTSE: Sinhala Text Simplification Dataset and Evaluation

dc.citation.issue5
dc.citation.volume24
dc.contributor.authorRanathunga S
dc.contributor.authorSirithunga R
dc.contributor.authorRathnayake H
dc.contributor.authorDe Silva L
dc.contributor.authorAluthwala T
dc.contributor.authorPeramuna S
dc.contributor.authorShekhar R
dc.contributor.editorZitouni I
dc.date.accessioned2025-06-05T03:03:43Z
dc.date.available2025-06-05T03:03:43Z
dc.date.issued2025-05-08
dc.description.abstractText Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this article, we present a human-curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and 3,000 corresponding simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero-resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-andEvaluation.
dc.description.confidentialfalse
dc.format.pagination1-19
dc.identifier.citationRanathunga S, Sirithunga R, Rathnayake H, de Silva L, Aluthwala T, Peramuna S, Shekhar R. (2025). SiTSE: Sinhala Text Simplification Dataset and Evaluation. ACM Transactions on Asian and Low-Resource Language Information Processing. 24. 5. (pp. 1-19).
dc.identifier.doi10.1145/3723160
dc.identifier.eissn2375-4702
dc.identifier.elements-typejournal-article
dc.identifier.issn2375-4699
dc.identifier.number51
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/73001
dc.languageEnglish
dc.publisherAssociation for Computing Machinery
dc.publisher.urihttp://dl.acm.org/doi/10.1145/3723160
dc.relation.isPartOfACM Transactions on Asian and Low-Resource Language Information Processing
dc.rights(c) 2025 The Author/s
dc.rightsCC BY 4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectComputing methodologies → Artificial intelligence
dc.subjectNatural language processing
dc.subjectLanguage resources
dc.titleSiTSE: Sinhala Text Simplification Dataset and Evaluation
dc.typeJournal article
pubs.elements-id500941
pubs.organisational-groupOther
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
500941 PDF.pdf
Size:
2.64 MB
Format:
Adobe Portable Document Format
Description:
Published version.pdf
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:
Collections