Received: 17 May 2023 - Revised: 21 November 2023 - Accepted: 29 November 2023 - CAAI Transactions on Intelligence TechnologyDOI: 10.1049/cit2.12333

OR I G INAL RE SEARCH

Lexicon‐based fine‐tuning of multilingual language models for
low‐resource language sentiment analysis

Vinura Dhananjaya1 | Surangika Ranathunga1,2 | Sanath Jayasena1

1Department of Computer Science and Engineering,
University of Moratuwa, Moratuwa, Sri Lanka

2School of Mathematical and Computational
Sciences, Massey University, Palmerston North, New
Zealand

Correspondence

Surangika Ranathunga
Email: s.ranathunga@massey.ac.nz

Funding information

University of Moratuwa, Grant/Award Number:
SRC long‐term

Abstract
Pre‐trained multilingual language models (PMLMs) such as mBERT and XLM‐R have
shown good cross‐lingual transferability. However, they are not specifically trained to
capture cross‐lingual signals concerning sentiment words. This poses a disadvantage for
low‐resource languages (LRLs) that are under‐represented in these models. To better
fine‐tune these models for sentiment classification in LRLs, a novel intermediate task
fine‐tuning (ITFT) technique based on a sentiment lexicon of a high‐resource language
(HRL) is introduced. The authors experiment with LRLs Sinhala, Tamil and Bengali for a
3‐class sentiment classification task and show that this method outperforms vanilla fine‐
tuning of the PMLM. It also outperforms or is on‐par with basic ITFT that relies on an
HRL sentiment classification dataset.

KEYWORD S
deep learning, natural languages, natural language processing

1 | INTRODUCTION

Pre‐trained multilingual language models (PMLMs) have
shown very promising results for text classification, even for
low‐resource language (LRL) settings [1]. However, there is an
imbalance of language representation in these PMLMs—LRLs
are severely under‐represented in these models (representation
is determined by the amount of monolingual data per language
used in model pre‐training) [2]. Consequently, when fine‐tuned
with datasets of the same size, results for languages that are
well‐represented in the models are superior to those of the
languages that have a lower representation [3]. The amount of
task‐specific data used in fine‐tuning the PMLMs for down-
stream tasks is also a deciding factor [3, 4]. However, for LRLs,
creating labelled datasets is a challenge. Hence, when using
PMLMs for LRL text classification, further improvements
should be explored.

Intermediate Task Fine‐tuning (ITFT) is a promising
technique to improve the performance of PMLMs for down-
stream tasks, such as sentiment analysis, under resource‐poor
conditions. In ITFT, the PMLM is first fine‐tuned with a
dataset from a different language or a different task. Then this

model is further fine‐tuned with the target task of the
considered language [5–7]. We term this basic ITFT.

In contrast to this basic ITFT technique that makes use of
a sentiment annotated dataset from a HRL, we propose two
intermediate tasks (TransIT and AuxIT ) created from a
lexicon belonging to a high‐resource language (HRL):

� TransIT—We translate the terms in the HRL lexicon to the
LRL using a publicly available Machine Translation system.
The (possibly noisy) translations are paired with original
lexicon terms to create positive and negative samples,
considering their valence scores.Using these samples, wefine‐
tune the model as an intermediate binary classification task.

� AuxIT—We create a set of synthetic phrases (hereafter
referred to as Auxiliary Phrases [APs]) using the original
HRL lexicon, and prepend them to each training data
sample of the target LRL dataset. This augmented dataset
helps to create a binary classification task where the AP and
data sample having the same sentiment is a positive instance,
else, a negative instance. This binary classification task is
used as an intermediate task that aligns sentiment words of
the target LRL with their counterparts in the HRL.

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.

© 2024 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing
University of Technology.

CAAI Trans. Intell. Technol. 2024;1–10. wileyonlinelibrary.com/journal/cit2 - 1

https://doi.org/10.1049/cit2.12333
https://orcid.org/0000-0003-0701-0204
mailto:s.ranathunga@massey.ac.nz
https://orcid.org/0000-0003-0701-0204
http://creativecommons.org/licenses/by/4.0/
https://ietresearch.onlinelibrary.wiley.com/journal/24682322
http://crossmark.crossref.org/dialog/?doi=10.1049%2Fcit2.12333&domain=pdf&date_stamp=2024-04-01


In both these methods, after the intermediate fine‐tuning
step is performed, the model is further fine‐tuned with the
sentiment classification dataset of the target LRL. Both these
intermediate fine‐tuning methods aim to provide an additional
cross‐lingual signal to the PMLM, on the relationship between
sentiment words belonging to different languages.

Our proposed methods are in line with that of Ke et al.
[8]’s objective to include external knowledge, but we use
lexicon‐based fine‐tuning in contrast to their pre‐training. For
our methods to work, we assume that there exists a sentiment
lexicon for an HRL, that has valence scores for each lexicon
term. This is not an unreasonable assumption—for English,
there are several such lexicons [9, 10].

We experiment with three LRLs, namely Sinhala, Tamil, and
Bengali for the task of sentiment classification. We observe that
our AuxIT intermediate task outperforms the vanilla fine‐tuning
baseline. Interestingly, this observation holds for our experi-
ments with an English dataset as well. It also outperforms, or on
par with basic ITFT, which makes use of an HRL sentiment
classification dataset. Then, we further experiment with
sequential ITFT by combining AuxITwith an intermediate task
created from a sentiment classification dataset of an HRL (note
that this is the type of task used in basic ITFT, as mentioned
earlier). In other words, we fine‐tune the PMLM in a sequential
manner with the new task we propose, as well as the HRL
sentiment classification task. Finally, we fine‐tune this model
with the sentiment classification dataset of the target LRL. This
sequential ITFT model further outperforms both vanilla fine‐
tuning as well as basic ITFT for Sinhala.

We show that the model performance depends on the
quality of the HRL lexicon more than the relatedness between
the HRL and the target LRL. Interestingly, using a HRL
lexicon is better than using a noisy lexicon from the same
language, when creating APs using our second method.

2 | RELATED WORK

2.1 | Intermediate task fine‐tuning

ITFT has been proposed as a technique that can potentially
improve a target task performance on a pre‐trained language
model (PLM). Phang et al. [11] can be considered as the first to
introduce the concept of ITFT for PLMs. They mentioned that
ITFT is intended to alleviate catastrophic forgetting of the
PLM and improve the robustness of the PLM. However, they
further mentioned that determining the proper combination of
ITFT tasks and target tasks that work well could be a challenge.
As Vu et al. [12] mentioned, factors such as dataset size, the
similarity between the source and target tasks, and the domain
are important for the effectiveness of ITFT. Pruksachatkun
et al. [6] carried out an empirical study on ITFT tasks to un-
derstand the mechanisms behind ITFT for cross‐task transfer.
They mentioned that target tasks that involve reasoning such as
question answering, would benefit more from ITFT.

ITFT has been used in both token‐level and sequence‐level
tasks, including tasks that are similar to sentiment classification.

As an example of a sequence‐level task, in Savini and Caragea
[5] a sarcasm detection task was performed using pre‐trained
BERT‐based models. They used multiple intermediate tasks
such as emotion detection from general tweets and sentiment
classification of movie reviews, and observed that different
ITFT tasks can help the target task in different ways. An
example of a token‐level task is de la Rosa [13] which used
ITFT with borrowing word detection as the target task. ITFT
has been effective in sequence‐sequence tasks such as Neural
Machine Translation [14] as well.

2.2 | Use of external knowledge bases

Using external knowledge bases such as lexicons or knowledge
graphs has also been proposed as an alternative method to
improve the performance of PLMs in downstream NLP tasks.
Lauscher et al. [15] proposed a method that can extend BERT
to perform better in GLUE benchmark [16] tasks with the help
of additionally infused lexical knowledge. Peters et al. [17]'s
KnowBERT model, refined with external knowledge using
entity‐linking modelling and multiple external knowledge
sources such as Wikipedia and WordNet, improves perfor-
mance in tasks such as Word Sense Disambiguation. Similarly,
Liu et al. [18] injected PLMs with Wikidata knowledge triplets
and showed improved performance for knowledge‐intensive
downstream tasks.

External linguistic knowledge has been leveraged to
improve sentiment classification task results. Teng et al. [19]
proposed a simple weighted‐sum technique that can leverage
lexicons to learn context‐aware features for sentiment analysis.
Qian et al. [20] showed improved results for sentiment clas-
sification with LSTM models, with the help of a lexicon con-
structed from MPQA [21] and SST dataset1. Suresh and Ong
[22] proposed a method of using synthesised vector embed-
dings to provide external knowledge to the model. Particularly,
sentiment lexicons have been used as additional knowledge
sources to enhance sentiment classification capabilities of deep
learning models such as CNNs [23] and RNNs [24–26]. Lex-
icons have also been used to improve sentiment‐aware repre-
sentations in simple Transformers [27]. Ke et al. [8] proposed a
technique that acquires word‐level linguistic knowledge into
language models such as BERT with the help of a label‐aware
pre‐training task and SentiWordNet [28] to capture sentiment
words. However, these works do not directly incorporate ITFT
as a method to improve the target task on PLMs.

3 | METHODOLOGY

Our solution is based on ITFT and sentiment lexicons. First, we
introduce two intermediate tasks TransIT and AuxIT created
using a sentiment lexicon of an HRL. We fine‐tune the PMLM
with each of these tasks separately before it is fine‐tuned for the

1
https://nlp.stanford.edu/sentiment/.

2 - DHANANJAYA ET AL.

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense

https://nlp.stanford.edu/sentiment/


sentiment classification task with LRL data. The idea behind
introducing such intermediate tasks is to provide an external
cross‐lingual alignment signal to themodel such that themodel is
facilitated by the understanding of words in theHRL, to improve
the understanding of the low‐resource ones. Next, we imple-
ment sequential ITFT by combining the best of our newly
introduced intermediate tasks with an intermediate task created
from an HRL sentiment classification dataset.

3.1 | Baselines

We employ three baselines:

� Fine‐tune the PMLM with an HRL sentiment classification
dataset, and test with the LRL sentiment classification
dataset (zero‐shot)

� Fine‐tune the PMLM with the LRL sentiment classification
dataset (i.e. No ITFT)

� Basic ITFT—fine‐tune the PMLM first with a sentiment
classification dataset of an HRL, and then with the LRL
sentiment classification dataset

3.2 | Bilingual sentiment‐word phrases as
intermediate task data (TransIT)

Our first method is straightforward. We use a sentiment
lexicon from an HRL, where each term (i.e. a sentiment word)
has a valence score. Valence score can be used as a measure of
the sentiment of a word, where the positiveness of a word
increases as the valence value nears 1 and negativeness in-
creases when the valence score nears 0 [10, 29].

We create a set of phrases that contain HRL terms and
their corresponding translations in the LRL. These terms are
selected based on their valence scores (i.e. positive sentiment
words corresponding to high valence scores). Then each of
these phrases gets labelled as 1, as they carry terms bearing a
similar sentiment.

The following example shows how a positive (label 1) is
created:
Example: For Tamil; “good Nalla[SEP]affection Pācam

[EOS]”; (label = 1)2.
We create another set of phrases labelled as 0, by

combining original lexical terms with translated terms having a
different sentiment:
Example: “good (Nalla[SEP]toxic Naccu[EOS]”);

(label = 0).
Here, the English terms are paired up with a Tamil term

with a dissimilar sentiment (the transliterations and translations
of the Tamil words in their original script can be found in
Figure A2). We use the created phrases in a binary classification
task and fine‐tune the PMLM.

3.3 | Augmented LRL data as intermediate
task data (AuxIT)

Similar to the TransIT method, in this technique as well,
auxiliary phrases (or APs) are created considering the valence
scores of the HRL lexicon. However, unlike in TransIT, in this
method, we create phrases that can be verified to carry the
intended sentiment value to the model. We do this by feeding
the created phrases to a separately fine‐tuned model as a
sentiment classification task and ensuring the model classifies
the phrases to the expected sentiment class.

We expect that these newly synthesised APs can provide an
external alignment signal to the model. In other words, since an
AP is guaranteed to have a specific, pre‐known sentiment
value, we expect that the AP would give an alignment signal
related to the particular sentiment class during the intermediate
task fine‐tuning phase of the model.

To identify positive, neutral and negative words in the
lexicon, we first manually define valence score intervals. This is
done by manually inspecting the valence score ranges of the
lexicon against the words and defining the valence score ranges
for positive, neutral and negative words. We verify this manual
selection by providing a set of APs in English to a fine‐tuned
model on the English dataset and observing that the model
predicts the expected sentiment classes (an example is shown in
the Appendix).

After selecting sentiment words from a lexicon, they are
then converted into phrases by considering all the permuta-
tions of the selected words. To select the best APs, we use a
separate PMLM fine‐tuned on a 3‐class (positive, negative,
neutral) sentiment classification dataset of the same HRL3. We
define the set of initial APs created using the permutations of
sentiment words picked from the lexicon as Sk, where k de-
notes the sentiment classes (k ∈ {neutral, negative, positive})
and |Sk| = N, N ∈ (Zþ). The best AP(s) (denoted by the set s)
for a particular sentiment class is selected by;
s¼ argmax i ∈ Sk

1 ≤ i ≤ N

M ðiÞ, where M denotes the model fine‐

tuned with English data, and |s| ≥ 1.
According to the example shown in Figure 1, for the

neutral sentiment class, we filter the best AP(s) by feeding
them to this fine‐tuned PMLM and taking the phrases that give
the highest positive output logit value for the intended senti-
ment class. Each AP has a specific sentiment based on the
words they contain. Our APs resemble a structure similar to
“Universal Adversarial Triggers” [30]; however, we use senti-
ment words from a lexicon to create the APs whereas Wallace
et al. [30] create trigger phrases with a refined subset of the
model vocabulary (with no reference to sentiment words).

Then these APs are prepended to the original data samples
of the target language4. When the AP and the target language
data sample have a similar sentiment, the augmented sample is

2
We used transliterated Tamil words here to avoid script‐related issues and improve the
paper’s readability. But we used the words in their respective scripts in experiments.
Translations are shown in Appendix 8.

3
We create the required fine‐tuned model with our English dataset (Tweets). This could
be a different model fine‐tuned on the same 3 classes. This fine‐tuning is a one‐time task.
4
From initial experiments, we found that prepending provides slightly better results than
appending APs to sentences.

DHANANJAYA ET AL. - 3

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense


labelled as 1, and 0 otherwise. An example using Sinhala is
shown in Figure 2. There, terms in the AP contain neutral
sentiments (i.e.‐valence scores in the (0.4, 0.6) interval), which
means the AP bears a neutral sentiment. The Sinhala phrases
are translated as; “There’s more here than we know” and “This
work should be given maximum punishment”.

The PMLM is fine‐tuned with this augmented dataset.
Finally, this fine‐tuned model is again fine‐tuned using the LRL
dataset as the final training task (see Figure 2).

3.4 | Sequential ITFT

As will be presented in Section 4.3, only our AuxIT interme-
diate task outperformed the baselines. Therefore we combine
this intermediate task with our third baseline (fine‐tune the
PMLM with data from sentiment classification data of an
HRL) in a sequential manner. In other words, we first fine‐tune
the PMLM with one of these tasks, and then with the other.
Finally, the resulting model is further fine‐tuned with the
dataset from the LRL.

4 | EXPERIMENTS

4.1 | Datasets and lexicons

For English, we use the US Twitter Airline Sentiment dataset5 (a
general domain dataset) and a dataset from the financial domain
[31]. Note that English is the HRL used in our experiments. We
use a 4‐class (positive, negative, neutral, conflict) Sinhala

sentiment dataset [32], which consists of news comments
extracted from news websites, and remove the conflict class for
our experiments. For Tamil and Bengali, we use datasets released
by Hande et al. [33] and Islam et al. [34], respectively (the Tamil
dataset consists of code‐mixed data samples as well).

We use the VAD sentiment lexicon [10] primarily as our
HRL sentiment word lexicon. This lexicon contains valence,
dominance, and arousal scores for a set of 20 k English words
and their translations for 102 other languages. We also exper-
iment with VADER [9], which consists of 7520 sentiment
words (including emojis) and their valence scores.

4.2 | Training setup

We select XLM‐R‐base as the PMLM. It supports all the lan-
guages considered in our experiments and has shown promising
results for sentiment analysis [35, 36]. In method 1, we create a
binary dataset with 26,000 data samples using all the sentiment
words in the VAD lexicon, with 4 terms per data sample. For
method 2, we prepend 50% of the original training sentences (to
have a balanced number of data points in the two classes) with
APs having the same sentiment and the rest with APs having
dissimilar sentiments. We average the results across 3 randomly
initialised runs and report the macro averages of the F1 scores.
Hyperparameters are given in Table A1 in Appendix.

4.3 | Comparative results for different ITFT
setups

Table 1 shows the results. Our first method, TransITyields lower
results than the baseline for Sinhala (macro‐F1 69.33%) and for
Tamil (61.08%). Thus we do not report this result in the table,

F I GURE 1 Creating APs from the lexicon. A neutral AP is considered as an example. Dotted boxes connected by blue arrows show example instances in
respective steps. Dotted arrows represent inputs from lexicons/datasets.

5
https://www.kaggle.com/crowdflower/twitter‐airline‐sentiment.

4 - DHANANJAYA ET AL.

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense

https://www.kaggle.com/crowdflower/twitter-airline-sentiment


nor do we use TransIT for further experiments. Although the
first method is trivial to implement, the phrases created may not
always yield the intended sentiment value to the model. This is
because, although an AP contains specific sentiment words, it is
not guaranteed to convey the intended sentiment through the
AP.We verify this by feeding theseAPs to anXLM‐Rmodel fine‐
tuned on a sentiment classification task and checking their pre-
dicted labels. We observe that the model fails to predict the APs
to their intended label, even though they contain sentiment
words belonging to the respective sentiment class. This could
happen because theAPs are too short and lack a proper structure
to carry useful information [37, 38].

In Table 1, a clear performance gain is visible for our
second method (AuxIT) against the second baseline (vanilla
Fine‐tuning). The highest gains are reported for English and
Sinhala datasets. The performance gain of the AuxIT method
is better than or on par with basic ITFT (third baseline) as well.
Note that for English, which is the HRL considered in our
experiments, we use APs from the same language, unlike for
the other three languages. Also, basic ITFT does not hold for
English, because English is the HRL dataset we used. Inter-
estingly, even for English, the AuxIT method shows noticeable
gains, which shows the utility of using lexicons in fine‐tuning
PMLMs for tasks in HRLs.

In sequential ITFT, when basic ITFT is followed by AuxIT,
it did not improve the results. However, the reserve ordering
yielded improved results for Sinhala.

5 | ABLATION STUDY

We carry out several ablation experiments on the Sinhala
dataset, to determine the effects of the factors given in the
following list. We use 1 AP per class for the first three ex-
periments, and compare their results with the baseline obtained
from vanilla fine‐tuning of XLM‐R (see Table 2).

� Effect of the language used to create APs (using the tweets
dataset)6

� Valence scores of the sentiment words
� Sentiment lexicon
� Number of terms in an AP
� Number of APs

5.1 | Effect of the language used to
create APs

To observe the effect of the language (specifically, language
relatedness) of the lexicon used to create APs, we experiment
with APs created in different languages. We create sentiment
lexicons for Hindi, Tamil, and Bengali by translating the VAD

F I GURE 2 Proposed two‐stage fine‐tuning method using auxiliary phrases. The neutral AP from Figure 1 is considered. An example of using AP with an
actual data sample from the Sinhala dataset is shown at the bottom. The phrase in ‘red’ denotes the created AP which is appended by a sentence of the target
LRL dataset (in ‘green’). Here, the label = 1 is assigned as both phrases carry the same sentiment value (neutral). The transliterations and translations of the
sentences are shown in Figure A1 which are labelled as neutral and negative (respectively) in the original dataset.

TABLE 1 Macro‐F1 scores of experiments with different methods. The best performance for each experiment is indicated with bold numbers.

Dataset Baseline 1 Baseline 2 Baseline 3 AuxIT Basic ITFT−>AuxIT AuxIT−>basic ITFT

English (Tweets) ‐ 80.17 ‐ 81.32 ‐ ‐

English (Finance) ‐ 87.92 ‐ 88.77 ‐ ‐

Sinhala 62.23 69.61 70.30 71.19 69.10 71.45

Tamil 43.46 63.87 64.65 64.67 61.97 63.86

Bengali 40.20 42.73 43.31 43.26 37.59 39.14

6
We continue with this dataset as it yielded better gain with our method than the other
dataset.

DHANANJAYA ET AL. - 5

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense


English lexicon (We used Google Translate). Results are re-
ported in experiment 1 of Table 2. Although Hindi and Bengali
belong to the same language family as Sinhala, and Tamil is
geographically co‐located with Sinhala, the results are low
compared to the English lexicon. We believe this is due to the
higher representation of English in XLM‐R compared to other
languages [1]. While translating APs to other languages,
translation errors can occur and the noisiness of these trans-
lations can be another reason. Some examples of such erro-
neous translations are presented below. The English words
were taken from the VAD lexicon.

� The word “flop” (has been given a valence score of 0.081—
negative) does not have the correct translation in Sinhala. It
only has the transliteration of the word.

� The word “forge” (has been given a valence score of 0.52;
which is in the neutral region). It has a Sinhala translation
which is related to only one of its meanings; “deceptive
imitation” which should be a negative sentiment.

� The word phrase “pissing me off” (has been given valence
score 0.208—negative) is associated with a Sinhala translation
with an opposite (positive) sentiment; “pibidev” (The English
translation of the given Sinhala word “pibidev” is “Arise”).

� The word “abbot” has no translation at all.

Such errors could happen when the lexicon creators try to
translate their lexicons into LRLs, relying on machine trans-
lation tools at their disposal [39, 40].

5.2 | Effect of the valence scores of lexicon
words

To determine the importance of the valence score for creating
APs, we created random APs by using randomly picked words
from each valence score interval, where we observe a drop in
results (experiment 2 in Table 2). This could be due to that

randomly picked terms being weak sentiment words (i.e.‐
valence scores are not strong enough). This observation jus-
tifies the AP selection method we introduce in AuxIT.

5.3 | Effect of the sentiment lexicon

Experiment 3 in Table 2 shows the results for the two different
lexicons we used (VADER and VAD), where VAD performs
well with the possible reason being that it contains a diverse set
of words, especially belonging to the neutral class.

5.4 | Effect of the length of APs and number
of APs

We also conducted experiments to determine the optimal
number of lexicon terms in an AP, and the number of APs
used per sentiment class. Figure 3 shows a result drop as the
number of words per AP is increased (orange line), possibly
due to over‐fitting of the model during fine‐tuning. We do not
experiment beyond 8 words per AP as it takes an excessive
amount of time to process and run through all the permuta-
tions. We also found that 2 different APs work best for our
approach and that more APs could hinder the performance as
seen in Figure 3 (blue line).

5.5 | Impact of ITFT on cross‐lingual
alignment of sentiment words

With the proposed two intermediate fine‐tuning methods, we
expect to provide an additional cross‐lingual signal to the
PMLM via the APs. To verify whether our method has been
successful in this, we analyse the latent‐space representations
(word embeddings) of individual sentiment words from Sinhala
and English, with, and without our AuxIT intermediate task.
We manually pick English and Sinhala positive/negative
sentiment words. The visualisation (process details are in Ap-
pendix) in Figure 4 shows that Sinhala and English words with
similar sentiments have been grouped much closer after the
intermediate fine‐tuning step. This is particularly true for
negative words7.

6 | CONCLUSION

We proposed two cross‐lingual intermediate task fine‐tuning
methods on PMLMs for sentiment analysis of LRLs, based
on a sentiment lexicon of an HRL. Out of these, fine‐tuning on
augmented data created from the HRL lexicon (AuxIT) yielded
noticeable improvements over vanilla fine‐tuning. AuxIT out-
performed or was on par with basic ITFT as well. We showed

TABLE 2 Results (Macro‐F1) for experiments with varying attributes
of the APs for Sinhala sentiment dataset. The best performance obtained
for each experiment is indicated with bold numbers.

Experiment no. Changed AP attribute F1

‐ Baseline 2–vanilla fine‐tuning 69.61

1 APs in different languages

English 70.56

Sinhala 69.66

Tamil 70.28

Bengali 69.16

Hindi 69.73

2 Randomly selected APs 67.66

3 APs created with different lexicons

VAD sentiment lexicon 70.56

VADER 69.99

7
Due to font issues, we show transliterated words in the graph, but use the words in their
actual script for experiments. We show their translations and transliterations in Figure A3.

6 - DHANANJAYA ET AL.

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense


that this result gain is due to the newly introduced intermediate
fine‐tuning technique (AuxIT) providing an additional cross‐
lingual signal to the PMLM to learn the similarity between
sentiment words belonging to different languages. We further
introduced sequential ITFT, which fine‐tunes the PMLM with
AuxIT and basic ITFT in a sequential manner.

Our solution was tested only for languages included in
XLM‐R. In the future, we will consider other models, and
languages not included in them. Another future avenue is to

refine our method for more fine‐grained sentiment classifica-
tion tasks, whereas, our current experiments considered only a
coarse‐grained sentiment classification.

ACKNOWLEDGEMENT
Vinura Dhananjaya was funded by a Senate Research Com-
mittee grant (SRC/LT/2020/11) of the University of
Moratuwa.

Open access publishing facilitated by Massey University, as
part of the Wiley ‐ Massey University agreement via the
Council of Australian University Librarians.

CONFLICT OF INTEREST STATEMENT
The author declares no conflict of interest.

DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no datasets were
generated during the current study.

ORCID
Surangika Ranathunga https://orcid.org/0000-0003-0701-
0204

REFERENCES
1. Hu, J., et al.: Xtreme: a massively multilingual multi‐task benchmark for

evaluating cross‐lingual generalisation. In: International Conference on
Machine Learning, pp. 4411–4421. PMLR (2020)

2. Ranathunga, S., DeSilva, N.: Some languages are more equal than others:
probing deeper into the linguistic disparity in the NLP world. In: Pro-
ceedings of the 2nd Conference of the Asia‐Pacific Chapter of the As-
sociation for Computational Linguistics and the 12th International Joint
Conference on Natural Language Processing, pp. 823–848 (2022)

3. Wu, S., Dredze, M.: Are all languages created equal in multilingual
BERT? In: Proceedings of the 5th Workshop on Representation
Learning for NLP Online, pp. 120–130. Association for Computational
Linguistics (2020). https://aclanthology.org/2020.repl4nlp‐1.16

F I GURE 3 macro‐F1 score with varying number of APs per class and no. words per AP.

F I GURE 4 Word embeddings visualisation for positive (blue),
negative (red) words in Sinhala and English. The circle markers show
embeddings from a vanilla fine‐tuned model (on Sinhala dataset) and the
triangle markers show embeddings from our approach.

DHANANJAYA ET AL. - 7

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense

https://orcid.org/0000-0003-0701-0204
https://orcid.org/0000-0003-0701-0204
https://orcid.org/0000-0003-0701-0204
https://aclanthology.org/2020.repl4nlp-1.16
https://orcid.org/0000-0003-0701-0204


4. Doddapaneni, S., et al.: A primer on pretrained Multilingual Language
Models. CoRR (2021). abs/2107.00676 https://arxiv.org/abs/2107.
00676

5. Savini, E., Caragea, C.: Intermediate‐task transfer learning with BERT
for sarcasm detection. Mathematics 10(5), 844 (2022). 10.3390/
math10050844. https://www.mdpi.com/2227‐7390/10/5/844

6. Pruksachatkun, Y., et al.: Intermediate‐task transfer learning with pre-
trained Language Models: when and why does it work? In: Proceedings
of the 58th Annual Meeting of the Association for Computational Lin-
guistics Online, pp. 5231–5247. Association for Computational Lin-
guistics (2020). https://aclanthology.org/2020.acl‐main.467

7. Chang, T.Y., Lu, C.J.: Rethinking why intermediate‐task fine‐tuning
works. In: Findings of the Association for Computational Linguistics:
EMNLP 2021 Punta Cana, Dominican Republic, pp. 706–713. Associ-
ation for Computational Linguistics (2021). https://aclanthology.org/
2021.findings‐emnlp.61

8. Ke, P., et al.: SentiLARE: sentiment‐aware language representation
learning with linguistic knowledge. In: Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language Processing
(EMNLP) Online, pp. 6975–6988. Association for Computational Lin-
guistics (2020). https://aclanthology.org/2020.emnlp‐main.567

9. Hutto, C., Gilbert, E.: VADER: a parsimonious rule‐based model for
sentiment analysis of social media text. Proceedings of the International
AAAI Conference on Web and Social Media 8(1), 216–225 (2014).
10.1609/icwsm.v8i1.14550. https://ojs.aaai.org/index.php/ICWSM/
article/view/14550

10. Mohammad, S.: Obtaining reliable human ratings of valence, arousal, and
dominance for 20,000 English words. In: Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pp. 174–184. Association for Computational Linguistics,
Melbourne (2018). https://aclanthology.org/P18‐1017

11. Phang, J., Févry, T., Bowman, S.R.: Sentence Encoders on STILTs:
Supplementary Training on Intermediate Labeled‐Data Tasks (2018).
arXiv–1811.arXiv e‐prints

12. Vu, T., et al.: Exploring and predicting transferability across NLP tasks.
In: Proceedings of the 2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP) Online, pp. 7882–7926. Association
for Computational Linguistics (2020). https://aclanthology.org/2020.
emnlp‐main.635

13. DelaRosa, J.: ADoBo 2021: the futility of STILTs for the classification of
lexical borrowings in Spanish. In: IberLEF@ SEPLN, pp. 947–955
(2021)

14. Nayak, S., et al.: Leveraging Auxiliary Domain Parallel Data in Inter-
mediate Task Fine‐tuning for Low‐Resource Translation. arXiv preprint
arXiv:230601382 2023

15. Lauscher, A., et al.: Informing unsupervised pretraining with external
linguistic knowledge. CoRR (2019). abs/1909.02339 http://arxiv.org/
abs/1909.02339

16. Wang, A., et al.: GLUE: a multi‐task benchmark and analysis platform for
natural language understanding. In: Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks
for NLP, pp. 353–355. Association for Computational Linguistics,
Brussels (2018). https://aclanthology.org/W18‐5446

17. Peters, M.E., et al.: Knowledge enhanced contextual word representa-
tions. In: Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP‐IJCNLP), pp. 43–54. Asso-
ciation for Computational Linguistics, Hong Kong (2019). https://
aclanthology.org/D19‐1005

18. Liu, L., et al.: Knowledge based Multilingual Language model. CoRR,
10962 (2021). abs/2111 https://arxiv.org/abs/2111.10962

19. Teng, Z., Vo, D.T., Zhang, Y.: Context‐sensitive lexicon features for
neural sentiment analysis. In: Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pp. 1629–1638.
Association for Computational Linguistics, Austin (2016). https://
aclanthology.org/D16‐1169

20. Qian, Q., et al.: Linguistically regularized LSTM for sentiment classifi-
cation. In: Proceedings of the 55th Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Papers), pp. 1679–1689.
Association for Computational Linguistics, Vancouver (2017). https://
aclanthology.org/P17‐1154

21. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in
phrase‐level sentiment analysis. In: Proceedings of Human Language
Technology Conference and Conference on Empirical Methods in
Natural Language Processing Vancouver, British Columbia, pp. 347–354.
Association for Computational Linguistics, Canada (2005). https://
aclanthology.org/H05‐1044

22. Suresh, V., Ong, D.C.: Using knowledge‐embedded attention to augment
pre‐trained Language Models for fine‐grained emotion recognition. In:
2021 9th International Conference on Affective Computing and Intelli-
gent Interaction (ACII), pp. 1–8. IEEE (2021)

23. Shin, B., Lee, T., Choi, J.D.: Lexicon integrated CNN models with
attention for sentiment analysis. In: Proceedings of the 8th Workshop on
Computational Approaches to Subjectivity, Sentiment and Social Media
Analysis, pp. 149–158. Association for Computational Linguistics,
Copenhagen (2017). https://aclanthology.org/W17‐5220

24. Kumar, A., Kawahara, D., Kurohashi, S.: Knowledge‐enriched two‐
layered attention network for sentiment analysis. In: Proceedings of the
2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 2
(Short Papers), pp. 253–258. Association for Computational Linguistics,
New Orleans (2018). https://aclanthology.org/N18‐2041

25. Margatina, K., Baziotis, C., Potamianos, A.: Attention‐based conditioning
methods for external knowledge integration. In: Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics Flor-
ence, Italy, pp. 3944–3951. Association for Computational Linguistics
(2019). https://aclanthology.org/P19‐1385

26. Ma, Y., Peng, H., Cambria, E.: Targeted aspect‐based sentiment analysis
via embedding commonsense knowledge into an attentive LSTM. Proc.
AAAI Conf. Artif. Intell. 32(1) (2018). 10.1609/aaai.v32i1.12048.
https://ojs.aaai.org/index.php/AAAI/article/view/12048

27. Zhong, P., Wang, D., Miao, C.: Knowledge‐enriched transformer for
emotion detection in textual conversations. In: Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing
(EMNLP‐IJCNLP), pp. 165–176. Association for Computational Lin-
guistics, Hong Kong (2019). https://aclanthology.org/D19‐1016

28. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced
lexical resource for sentiment analysis and opinion mining. In: Pro-
ceedings of the Seventh International Conference on Language Re-
sources and Evaluation (LREC’10) Valletta, Malta: European Language
Resources Association. ELRA) (2010). http://www.lrec‐conf.org/
proceedings/lrec2010/pdf/769_Paper.pdf

29. Mehrabian, A.: Pleasure‐arousal‐dominance: a general framework for
describing and measuring individual differences in temperament. Curr.
Psychol. 14(4), 261–292 (1996). https://doi.org/10.1007/bf02686918

30. Wallace, E., et al.: Universal adversarial triggers for attacking and
analyzing NLP. In: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP‐IJCNLP), pp.
2153–2162. Association for Computational Linguistics, Hong Kong
(2019). https://aclanthology.org/D19‐1221

31. Malo, P., et al.: Good debt or bad debt: detecting semantic orientations in
economic texts. Journal of the Association for Information Science and
Technology 65(4), 782–796 (2014). https://doi.org/10.1002/asi.23062

32. Senevirathne, L., et al.: Sentiment Analysis for Sinhala Language Using
Deep Learning Techniques. arXiv preprint arXiv:201107280 2020

33. Hande, A., et al.: Benchmarking multi‐task learning for sentiment analysis
and offensive language identification in under‐resourced dravidian lan-
guages. CoRR (2021). abs/2108.03867 https://arxiv.org/abs/2108.03867

34. Islam, K.I., Islam, M.S., Amin, M.R.: Sentiment analysis in Bengali via
transfer learning using multi‐lingual BERT. In: 2020 23rd International
Conference on Computer and Information Technology (ICCIT) IEEE,
pp. 1–5 (2020)

35. Rathnayake, H., et al.: Adapter‐based fine‐tuning of pre‐trained multi-
lingual language models for code‐mixed and code‐switched text

8 - DHANANJAYA ET AL.

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense

https://arxiv.org/abs/2107.00676
https://arxiv.org/abs/2107.00676
https://www.mdpi.com/2227-7390/10/5/844
https://aclanthology.org/2020.acl-main.467
https://aclanthology.org/2021.findings-emnlp.61
https://aclanthology.org/2021.findings-emnlp.61
https://aclanthology.org/2020.emnlp-main.567
https://ojs.aaai.org/index.php/ICWSM/article/view/14550
https://ojs.aaai.org/index.php/ICWSM/article/view/14550
https://aclanthology.org/P18-1017
https://aclanthology.org/2020.emnlp-main.635
https://aclanthology.org/2020.emnlp-main.635
http://arxiv.org/abs/1909.02339
http://arxiv.org/abs/1909.02339
https://aclanthology.org/W18-5446
https://aclanthology.org/D19-1005
https://aclanthology.org/D19-1005
https://arxiv.org/abs/2111.10962
https://aclanthology.org/D16-1169
https://aclanthology.org/D16-1169
https://aclanthology.org/P17-1154
https://aclanthology.org/P17-1154
https://aclanthology.org/H05-1044
https://aclanthology.org/H05-1044
https://aclanthology.org/W17-5220
https://aclanthology.org/N18-2041
https://aclanthology.org/P19-1385
https://ojs.aaai.org/index.php/AAAI/article/view/12048
https://aclanthology.org/D19-1016
http://www.lrec-conf.org/proceedings/lrec2010/pdf/769_Paper.pdf
http://www.lrec-conf.org/proceedings/lrec2010/pdf/769_Paper.pdf
https://doi.org/10.1007/bf02686918
https://aclanthology.org/D19-1221
https://doi.org/10.1002/asi.23062
https://arxiv.org/abs/2108.03867


classification. Knowl. Inf. Syst. 64(7), 1937–1966 (2022). https://doi.
org/10.1007/s10115‐022‐01698‐1

36. Dhananjaya, V., et al.: BERTifying Sinhala‐A comprehensive analysis of
pre‐trained Language Models for Sinhala text classification. In: Pro-
ceedings of the Thirteenth Language Resources and Evaluation Con-
ference, pp. 7377–7385 (2022)

37. Zhan, J., Dahal, B.: Using deep learning for short text understanding.
Journal of Big Data 10(1), 4 (2017). https://doi.org/10.1186/s40537‐
017‐0095‐2

38. Hussein, DMEDM: A survey on sentiment analysis challenges. Journal of
King Saud University—Engineering Sciences 30(4), 330–338 (2018).
10.1016/j.jksues.2016.04.002. https://www.sciencedirect.com/science/
article/pii/S1018363916300071

39. Wan, Y., et al.: Challenges of neural machine translation for short texts.
Comput. Ling. 48(2), 321–342 (2022). https://doi.org/10.1162/coli_a_
00435

40. Bapna, A., et al.: Building Machine Translation Systems for the Next
Thousand Languages.arXiv preprint arXiv:220503983 (2022)

41. VanderMaaten, L., Hinton, G.: Visualizing Data using t‐SNE. J. Mach.
Learn. Res. 9(86), 2579–2605 (2008). http://jmlr.org/papers/v9/
vandermaaten08a.html

42. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In:
International Conference on Learning Representations (2018)

How to cite this article: Dhananjaya, V., Ranathunga,
S., Jayasena, S.: Lexicon‐based fine‐tuning of
multilingual language models for low‐resource language
sentiment analysis. CAAI Trans. Intell. Technol. 1–10
(2024). https://doi.org/10.1049/cit2.12333

APPENDIX

Visual isat ion of word embeddings
We look at how XLM‐R word embeddings of several sentiment
words change when our method is used. We choose positive and
negative words in Sinhala and English, which are also present in
our training data and the VAD lexicon. We perform a dimen-
sionality reduction using Truncated Singular Value Decompo-
sition (Truncated SVD) followed by t‐SNE [41] to get 2D
representations of original XLM‐R embeddings for the words.
For dimensionality reduction, we set a fixed random state (We
use Scikit‐Learn’s implementation8) and try with different per-
plexity values for t‐SNE in [1, 50] interval and choose the vis-
ualisation producing the lowest Kullback‐Leibler (KL)
divergence after 1000 iterations (perplexity = 10). We used the
[CLS] token’s representation as the word vector. We select 16
words in both English and Sinhala and the transliterations/
translations of selected Sinhala words are shown in Figure A3.

A.1 | Hyperparameters
Table A1 shows hyperparameters and dataset sizes used for
each baseline experiment in different languages. Epochs
separated by a comma are for the intermediate fine‐tuning task
and the final 3‐class classification task respectively. We use
fewer epochs for Tamil than other languages, as we observe

higher epochs tend to overfit for the Tamil dataset using our
method. We use AdamW [42] for all fine‐tuning tasks.
An example for selecting an AP based on the output

logit values. We choose the 3 most positive words from the
lexicon (e.g. ‐ VADER); magnificently, ilu, aml and create

F I GURE A 1 Transliterations/translations of the example sentiment
sentences shown in Figure 2.

F I GURE A 2 Translations of the Tamil words used for the example in
Section 3.2 (for Method‐1).

F I GURE A 3 Transliterations/Translations for Sinhala words used in
Figure 4.

TABLE A1 Parameters used for each dataset for getting baseline
results 1.

Dataset Train/Test Epochs Learning rate Batch size

English (Tweets) 13176/1464 4, 3 5e‐6 16

English (Finance) 2037/1315 2, 3 5e‐6 8

Sinhala 11833/1314 4, 3 " "

Tamil 15694/1743 3, 2 " "

Bengali 14853/3000 5, 4 " "

8
https://scikit‐learn.org.

DHANANJAYA ET AL. - 9

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense

https://doi.org/10.1007/s10115-022-01698-1
https://doi.org/10.1007/s10115-022-01698-1
https://doi.org/10.1186/s40537-017-0095-2
https://doi.org/10.1186/s40537-017-0095-2
https://www.sciencedirect.com/science/article/pii/S1018363916300071
https://www.sciencedirect.com/science/article/pii/S1018363916300071
https://doi.org/10.1162/coli_a_00435
https://doi.org/10.1162/coli_a_00435
http://jmlr.org/papers/v9/vandermaaten08a.html
http://jmlr.org/papers/v9/vandermaaten08a.html
https://doi.org/10.1049/cit2.12333
https://scikit-learn.org


permutations from them. The permutations are then fed into a
fine‐tuned model and the best is selected by the highest logit
value output for positive sentiment class prediction. In this
example, we expect negative, neutral, and positive predictions
at indexes 1, 2 and 3 respectively from the model output array.
Hence, here we choose the fourth permutation in the list.

1. aml magnificently ilu: [‐2.0061314, −1.4377168,
3.1962798]

2. aml ilu magnificently: [‐1.8522748,−1.5239806, 3.1405883]
3. magnificently aml ilu: [‐2.0096264, −1.4296048,

3.1805775]
4. magnificently ilu aml: [‐1.9999465, −1.4706941,
3.1985717]

5. ilu aml magnificently: [‐1.6413125, −1.6105448,
3.0310,764]

6. ilu magnificently aml: [‐1.9787084,−1.4326444, 3.1722727]

APs (top two in each sentiment class) created using
VAD lexicon

� Positive—very positive magnificent love happy, joyful
greatness happiest happier

� Neutral—aardvark bluff bookseller token, mushroom
rigging bowler sifting

� Negative—shit suffering died toxic, decayed pain murderer
chaos

APs(top two in each sentiment class) created using
VADER lexicon
� Positive—magnificently ilu aml, euphoria ecstasy hearts

sweetheart
� Neutral—borer sceptics %)
� Negative—slavery raping rapist, murder rape kill terrorist

Computational resources
For all our experiments, we used the XLM‐R‐base model
which contains 270 M parameters, as the multilingual model.
We utilised a single shared GPU (Nvidia Quadro RTX 6000
24 GB). On average, it consumes ~0.4 h for one randomly
initialised run in each experiment.

10 - DHANANJAYA ET AL.

 24682322, 0, D
ow

nloaded from
 https://ietresearch.onlinelibrary.w

iley.com
/doi/10.1049/cit2.12333 by M

assey U
niversity L

ibrary, W
iley O

nline L
ibrary on [07/10/2024]. See the T

erm
s and C

onditions (https://onlinelibrary.w
iley.com

/term
s-and-conditions) on W

iley O
nline L

ibrary for rules of use; O
A

 articles are governed by the applicable C
reative C

om
m

ons L
icense


	Lexicon‐based fine‐tuning of multilingual language models for low‐resource language sentiment analysis
	1 | INTRODUCTION
	2 | RELATED WORK
	2.1 | Intermediate task fine‐tuning
	2.2 | Use of external knowledge bases

	3 | METHODOLOGY
	3.1 | Baselines
	3.2 | Bilingual sentiment‐word phrases as intermediate task data (TransIT)
	3.3 | Augmented LRL data as intermediate task data (AuxIT)
	3.4 | Sequential ITFT

	4 | EXPERIMENTS
	4.1 | Datasets and lexicons
	4.2 | Training setup
	4.3 | Comparative results for different ITFT setups

	5 | ABLATION STUDY
	5.1 | Effect of the language used to create APs
	5.2 | Effect of the valence scores of lexicon words
	5.3 | Effect of the sentiment lexicon
	5.4 | Effect of the length of APs and number of APs
	5.5 | Impact of ITFT on cross‐lingual alignment of sentiment words

	6 | CONCLUSION
	ACKNOWLEDGEMENT
	CONFLICT OF INTEREST STATEMENT
	DATA AVAILABILITY STATEMENT
	Visualisation of word embeddings
	Hyperparameters
	Computational resources