Vol.:(0123456789)1 3

Behavior Research Methods (2023) 55:3009–3025 
https://doi.org/10.3758/s13428-022-01955-9

Frustration and ennui among Amazon MTurk workers

Craig Fowler1  · Jian Jiao2 · Margaret Pitts2

Accepted: 10 August 2022 / Published online: 26 August 2022 
© The Author(s) 2022

Abstract
Academics are increasingly turning to crowdsourcing platforms to recruit research participants. Their endeavors have ben-
efited from a proliferation of studies attesting to the quality of crowdsourced data or offering guidance on managing specific 
challenges associated with doing crowdsourced research. Thus far, however, relatively little is known about what it is like to 
be a participant in crowdsourced research. Our analysis of almost 1400 free-text responses provides insight into the frustra-
tions encountered by workers on one widely used crowdsourcing site: Amazon’s MTurk. Some of these frustrations stem from 
inherent limitations of the MTurk platform and cannot easily be addressed by researchers. Many others, however, concern 
factors that are directly controllable by researchers and that may also be relevant for researchers using other crowdsourcing 
platforms such as Prolific or CrowdFlower. Based on participants’ accounts of their experiences as crowdsource workers, 
we offer recommendations researchers might consider as they seek to design online studies that demonstrate consideration 
for respondents and respect for their time, effort, and dignity.

Keywords Crowdsourcing · Ethics · Digital methods · Internet · Job satisfaction · Online research · Participants

Not long ago, Buhrmester et al. (2018) remarked on how 
rapidly Amazon Mechanical Turk (MTurk) went from being 
virtually unheard of to being ubiquitous across the social 
sciences. In 2009, for instance, just a handful of papers 
using MTurk were published in social science journals with 
impact factors of 2.5 or greater. This number had increased 
to almost 50 by 2011 and to just shy of 550 papers by 2015 
(Chandler & Shapiro, 2016).

Political scientists were quick to show their enthusiasm 
for the possibilities afforded by MTurk (Christenson & 
Glick, 2013), as were scholars in psychology. As recently 
as 2012, for instance, fewer than 10% of papers published in 

seven top psychology journals reported studies using MTurk. 
In five of the same seven journals, however, the proportion 
of published studies using MTurk was at least 24% by 2017, 
and in two specialist social psychology journals, the fig-
ure exceeded 40% (Stewart et al., 2017; Zhou & Fishbach, 
2016). A comparable proportion of studies (43%) published 
in the Journal of Consumer Research between June 2015 
and April 2016 drew on MTurk data (Goodman & Paol-
acci, 2017), suggesting that some business-focused disci-
plines have also embraced the use of crowdsourced data. 
The communication discipline was one of the “later entrants 
to the crowdsourcing arena” (Sheehan, 2018, pp. 140–141), 
but many communication and media scholars have now 
joined colleagues in other social science fields in using this 
approach to data collection. For instance, Stansberry (2020) 
reviewed articles published in Public Relations Review dur-
ing 2018 and found that 42% of articles employing online 
surveys or experiments drew on source data from MTurk or 
a similar platform.

Although the use of crowdsourcing platforms has been 
embraced by many academics, concern has also been voiced 
that they constitute “digital sweatshops” (Pittman & Shee-
han, 2016, p. 260), and are a “poorly paid hell” (Semuels, 
2018) in which participants are vulnerable to exploita-
tion. Nonetheless, in a large-scale survey conducted of 

 * Craig Fowler 
 c.fowler@massey.ac.nz

 Jian Jiao 
 jianj@email.arizona.edu

 Margaret Pitts 
 mjpitts@arizona.edu

1 School of Communication, Journalism, & Marketing, 
Massey University, Private Bag 102904, North Shore, 
Auckland 0745, New Zealand

2 Department of Communication, University of Arizona, 
Communication Building #222, 1103 E. University Blvd, 
Tucson, AZ 85721-0025, USA

http://orcid.org/0000-0003-0596-563X
http://crossmark.crossref.org/dialog/?doi=10.3758/s13428-022-01955-9&domain=pdf


3010 Behavior Research Methods (2023) 55:3009–3025

1 3

Turkers—that is, individuals who work on MTurk—38% 
felt “extremely positive” about the platform. Indeed, some 
declared in open-ended comments that they “love[d] work-
ing on MTurk,” and considered it a “Godsend” (Mehrotra, 
2020). Clearly, crowdsourced research platforms can offer 
benefits to both researchers and participants. However, 
increased use and reliance on such platforms compels us to 
question how crowdsourcing platforms can be used effec-
tively and ethically.

In what follows, we examine this question through a criti-
cal interpretive lens. We begin by discussing reasons for 
which researchers might be both wary of and enthusiastic 
about crowdsourced research, before highlighting factors 
that researchers who use crowdsourcing platforms should 
consider. Using principles of qualitative inquiry to preserve 
and amplify the voices of experienced Turkers, we then 
analyze a large corpus of open-ended responses in which 
Turkers describe how requesters can design online studies 
in such a way as to minimize the potential for frustration. In 
so doing, we hope that this will also enhance the degree to 
which researchers are able to operate in ways that recognize 
the essential humanity, agency, and contributions of crowd-
source workers in general, and Turkers in particular (Gleibs, 
2017). Ultimately, by considering Turkers’ descriptions of 
their frustrations in light of findings from the (web)-survey 
methods literature, we arrive at a series of suggestions for 
practice that may both improve the lot of crowdsource work-
ers and improve the quality of data obtained by researchers.

Reasons to be cautious about crowdsourced 
research

Researchers may be concerned that participants recruited 
via platforms such as MTurk may not be entirely “on the 
level.” Findings from some studies can appear to justify such 
concerns. For instance, by asking participants to report num-
bers obtained via rolls of virtual dice (which were linked 
to financial incentives) and comparing these responses to 
theoretically expected distributions, Suri et al. (2011) found 
evidence that Turkers misreported the results of their rolls 
even when the monetary incentive for doing so was small.

There is also evidence that a nontrivial proportion of 
Turkers misrepresent demographic and personality charac-
teristics to gain access to studies from which they would 
otherwise be disqualified (e.g., Chandler & Paolacci, 2017; 
Siegel & Navarro, 2019), which means that researchers may 
legitimately question whether Turkers can be counted on 
to be who they say they are. In one multi-study report, for 
example, between 24% and 83% of respondents in a given 
study were identified as probable imposters (Wessling et al., 
2017). The risk of researchers mistakenly believing they are 
actually studying a narrowly defined population of interest 

is thought to be amplified when recruiting from “a popu-
lation that has limited representation on MTurk” (Siegel 
& Navarro, 2019, p. 246). In a recent study, for instance, 
Burnette et al. (2022) sought to recruit a sample of over 
2000 transgender persons via MTurk to validate and estab-
lish norms for widely used measures of eating disorder 
symptoms. However, despite incorporating several tactics 
to increase the chances of gathering valid data, the research 
team was forced to abandon the planned project. Of 2413 
respondents who consented to participate, passed attention 
checks, and did not complete the study implausibly quickly, 
1060 provided inconsistent data with respect to their gender 
identity. Burnette et al. believed this cast doubt on whether 
they were, in fact, transgender, which was the primary cri-
terion for inclusion in the study.

Despite the challenges that can beset collecting data via 
MTurk (and other crowdsourcing platforms) there are steps 
researchers can take to reduce the possibility that their sam-
ple comprises and is compromised by imposters (Wessling 
et al., 20171). Similarly, although some researchers have 
experienced challenges to data integrity as a result of bots 
(i.e., “malicious software” that auto-generates “indiscrimi-
nate responses to survey questions”; Roman et al., 2022, 
p. 1) and the use of server farms, strategies exist to detect 
fraudulent responses (Chmielewski & Kucker, 2019).

A different sort of challenge to researchers has become 
more pressing as the use of crowdsourcing becomes wide-
spread. In the early days of crowdsourced research, there was 
only “a handful of respondents who participate habitually” 
(Berinsky et al., 2012, p. 366). Now, however, participant 
non-naïveté can be problematic. One of the “selling points” 
of the MTurk platform is the sheer number of potential par-
ticipants to whom researchers have access. Amazon’s own 
data tracking suggests that there are as many as 750,000 
unique monthly visitors to MTurk (Hitlin, 2016). None-
theless, the number of participants to whom any given 
research group has access is likely to be orders of magni-
tude smaller, with Stewart et al.’s (2015) analysis suggesting 
that “a typical laboratory can access about 7300 workers” 
(p. 479), many of whom have already completed thousands 
of tasks on MTurk (Chandler et al., 2014; Harms & DeSi-
mone, 2015). Such findings raise concern that as Turkers 
complete more and more studies, they become familiar 
with widely used stimuli and measures, thereby distorting 

1 Wessling et  al. (2017), for instance, recommend prior to launch-
ing their substantiative study, researchers create a brief pre-screening 
questionnaire. In a way that does not “give the game away,” research-
ers can mix questions that allow them to gauge whether eligibility 
criteria are met with filler items, and then make the main study avail-
able only to participants who met these eligibility conditions, either 
by setting qualifications within MTurk, or by creating participant 
“whitelists” via third-party platforms such as CloudResearch.


3011Behavior Research Methods (2023) 55:3009–3025 

1 3

research findings and—perhaps—increasing their potential 
to find Turking frustrating as they become more experienced 
survey-takers.

Another reason researchers may have reservations about 
carrying out crowdsourced research is that they may ques-
tion whether Turkers—as a population—are attentive and 
diligent participants. Lending credence to such concerns, 
some researchers have found Turkers to be less attentive to 
experimental materials than participants recruited in more 
traditional ways (Goodman et al., 2012), to provide such 
rapid responses as to raise questions regarding the trust-
worthiness of the data (Harms & DeSimone, 2015), or to 
multitask while completing human intelligence tasks (i.e., 
“HITs”; Chandler et al., 2014). Several studies provide use-
ful counterpoints, however. For instance, Necka et al. (2016) 
found that although Turkers may engage in undesirable 
respondent behaviors, they do not do so more frequently 
than participants recruited by other means. Furthermore, 
Hauser and Schwarz (2016) conducted a series of three 
studies from which they determined Turkers to be “more 
attentive to instructions than…college students” (p. 400; 
emphasis added).

Reasons to be enthusiastic 
about crowdsourced research

Early meta-research sought to compare the demographic 
composition of MTurk samples to that of other convenience 
samples and to “gold standard” representative samples. 
Findings suggested that while MTurk samples did “not per-
fectly match the demographic and attitudinal characteristics 
of the US population,” neither did they “present a wildly 
distorted view” (Berinsky et al., 2012, p. 361). Moreover, 
several studies indicated that MTurk samples were consider-
ably more demographically diverse than the typical Ameri-
can college sample that is often recruited for social scientific 
research (e.g., Berinsky et al., 2012; Buhrmester et al., 2011; 
Casler et al., 2013).

Researchers have also determined that the quality of 
data obtained via MTurk compares well to that gathered via 
professional marketing research companies or from college 
students with respect to measures of engagement, indices of 
test-retest reliability and internal consistency, and measures 
of criterion validity (Kees et al., 2017; Peer et al., 2014; 
Shapiro et al., 2013). Moreover, experimental effects and 
decision-making biases that have been well documented 
in laboratory or campus settings have been replicated with 
crowdsourced samples (Berinsky et al., 2012; Goodman 
et al., 2012; Paolacci et al., 2010; Peer et al., 2017), and 
researchers who have compared MTurk and college student 
samples have deemed the results “almost indistinguishable” 
(Casler et al., 2013, p. 2156). Attesting to the high degree 

of correspondence between results obtained via MTurk and 
those obtained in other ways, Stewart et al. (2017) observe 
that “in the many-labs project, the pattern of effects and null 
effects for 13 social psychology and decision-making coef-
ficients corresponded perfectly between concurrently col-
lected student and MTurk samples” (p. 741).

Finally, researchers’ confidence in the quality of data that 
can be collected via MTurk may be bolstered by the fact 
that the platform features a built-in system of only allowing 
workers with a sufficiently high reputation to participate in 
studies. Particularly in the relatively early days of MTurk 
research, findings indicated that academics could safely 
rely on participants’ MTurk reputation score as a means 
of assuring data quality (Peer et al., 2014). However, it is 
important to stress that this reputation score is only useful 
to the extent that requesters actually take the time to provide 
a meaningful rating of the work provided by workers. As 
Ahler et al. (2021) point out, although it may be relatively 
easy to provide a positive or negative rating for a participant 
who performs a task that can be completed in an objectively 
“right” or “wrong” way, it is more difficult for researchers 
to rate a participant who provides an opinion or responds 
to measures that are designed to capture subjective percep-
tions. Moreover, a researcher who only uses MTurk quite 
infrequently “has few incentives to sink resources into moni-
toring quality; instead her investment is typically capped at 
the payout rate” (p. 3). As such, Ahler et al. caution that a 
worker’s HIT approval rate is very likely to be an upwardly 
biased indicator of reputation.

In sum, we believe that, in general, research speaks 
favorably of the quality of data that can be obtained from 
respondents on the MTurk platform. Admittedly, samples 
drawn from MTurk cannot be treated as representative, but 
they are particularly well-suited for “conducting internally 
valid experiments” (Berinsky et al., 2012, p. 361).

The experience of Turking

Researchers have asked and answered numerous questions 
regarding how best to make use of MTurk and other crowd-
sourcing platforms. Unfortunately, the perspective of those 
constituting the crowd has largely been missing, although 
some researchers have published auto-ethnographic reports 
of their own experiences as temporary Turkers (Schmidt, 
2015), or considered the ethical treatment of crowdsourced 
participants (e.g., Chandler & Shapiro, 2016; Gleibs, 2017; 
Paolacci et al., 2010).

One area of concern regarding the ethical treatment of 
Turkers (and members of similar platforms) is compensa-
tion. Brown (2015, para 7) observed that “researchers are 
going to have to face up to the fact that by using MTurk, 
they are typically exploiting sub-minimum wage labour.” 


3012 Behavior Research Methods (2023) 55:3009–3025

1 3

His remark was not hyperbolic, for an analysis of 3.8 mil-
lion HITS showed that just 4% of a sample of 2676 Turkers 
earned more than $7.25 per hour (Hara et al., 2017). This is 
particularly concerning because many individuals partici-
pate in crowdsourced work because there is no other work 
available to them (Marder & Fritz, 2015; Mehrotra, 2020; 
Semuels, 2018), and a quarter of Turkers canvassed by Pew 
Research reported that they derived “all” or “most” of their 
income from MTurk (Hitlin, 2016). Highlighting the sensi-
tive issues surrounding compensation, Burnette et al. (2022) 
report receiving emails from Turkers that speak to the pres-
sure they felt to be approved for payment, and describe one 
email in particular in which a worker “pleaded for their HIT 
to be approved because they needed the money to feed their 
child” (p. 265).

There have been calls to ensure Turkers receive at least 
the local minimum wage (e.g., Harms & DeSimone, 2015), 
and perhaps follow the lead of a rival platform, Prolific, 
that will not allow researchers to pay below a certain level, 
sanctions requesters who underpay relative to the time a 
task takes, and—if requests to retroactively increase pay-
ment to acceptable levels go unheeded—suspends researcher 
accounts. Gleibs (2017) went a step further, suggesting that 
journals “could require authors to pay minimum-wage scale 
(from the respective participants’ country of residence) 
incomes to crowdsourced participants” (p. 1338) and recom-
mended that ethics boards should weigh “standards of fair 
pay and employment protection” when determining whether 
a study meets standards for ethical conduct.

Researchers must also consider several other issues 
related to compensation. For instance, although it is nor-
mative to compensate participants who withdraw from a 
study (either in whole, or proportional to the time they have 
invested in a study), crowdsourcing platforms often make it 
difficult to pay participants who do not complete a project 
(Burnette et al., 2022; Gleibs, 2017; Paolacci et al., 2010). 
Indeed, non-payment for work completed in good faith has 
been identified as a perennial problem facing Turkers, given 
that the platform allows requesters to reject work and refuse 
payment while still retaining use of the data obtained (Chan-
dler & Shapiro, 2016; Paolacci et al., 2010).

Other ethical principles also need to be considered. 
Unless researchers accurately report how much time par-
ticipants will need to spend on a HIT, and the nature of the 
work involved, it is difficult to claim that respondents are 
able to provide informed consent (Paolacci et al., 2010). 
Researchers must also recognize that the nature of online 
research makes it difficult to know whether their studies 
have caused harm to participants (Chandler & Shapiro, 
2016). Mehrotra (2020, paras. 32–33), for example, heard 
from workers who had felt “emotionally traumatized by an 
academic survey,” who experienced “intensely negative 
feelings,” and who were “brought to tears” by being asked 

to recall painful experiences. However, unless these work-
ers contacted researchers to relay these experiences, the 
researchers would have no way of knowing that participat-
ing in their studies had caused such distress.

Despite the legitimate concerns that have been expressed 
over the working conditions experienced by Turkers, a recent 
large-scale study offers an encouraging counterbalance. 
From a survey of over 4000 Turkers in the United States, 
Moss et al. (2020) found that they were not more financially 
vulnerable than were members of the wider US popula-
tion. Moreover, they generally did not find the experience 
of Turking to be stressful, nor did they often report finding 
their working conditions (or requesters) to be abusive. In 
fact, they found the benefits offered by Turking sufficiently 
appealing that they “would not trade the flexibility of MTurk 
for less than $25 per hour.”

It should also be emphasized that Turkers themselves 
seem to take their work seriously. Marder and Fritz (2015, 
para 74) spoke with 100 “Super Turkers,” and learned that 
many “reported a degree of pride in their work, despite 
the tedium and lousy pay. And notably, despite the lack of 
oversight, they weren’t even tempted to game the system.” 
Marder and Fritz go on to point out that although some 
researchers might worry that their study could be compro-
mised if participants share information about it, the offi-
cial rules for a popular online forum for Turkers (“Turker 
Nation”) specify that there must be “no disclosure or discus-
sion of attention memory checks. No discussion of survey 
content, period. That can affect the results” (Marder & Fritz, 
2015, para 76).

Research purpose

Our own research has increasingly drawn on crowdsourced 
samples. The impetus for this study comes from our desire 
to use the crowdsourcing platforms such as MTurk in ways 
that recognize and respect the contributions (and indeed, 
basic humanity) of the people who complete our studies. We 
therefore sought to gain direct insight from Turkers about 
their frustrations with—and recommendations for—crowd-
sourced research participation. The following research ques-
tion guided our inquiry.

RQ: What can we learn from the frustrations Turkers 
report about their participation in crowdsourced research?

Method

The first author was preparing to collect data for a study with 
specific eligibility criteria. To minimize the risk of charac-
ter misrepresentation, he followed Wessling et al.’s (2017) 
“two-survey process” by posting a paid pre-screening HIT to 


3013Behavior Research Methods (2023) 55:3009–3025 

1 3

MTurk (via CloudResearch) with a view to inviting eligible 
persons to be involved in the substantive part of the study. 
As part of this pre-screening HIT, the researcher included 
an open-ended question inquiring about participants’ rec-
ommendations for reducing frustrations for Turkers. Their 
responses provided rich insight into the experiences and frus-
trations of Turking and served as data for the current study.

We framed our study within a critical interpretivist para-
digm. By applying a critical paradigm, we were able to focus 
our attention on issues of power, values, exploitation, equity, 
and fairness (Lindlof & Taylor, 2011) experienced by Turk-
ers. By also applying an interpretivist frame, we were able 
to retain a sharp focus on the social realities, knowledge, 
and lived experiences of the participants (Lindlof & Taylor, 
2011). This approach allowed us to preserve and amplify 
the voices and experiences of Turkers, to provide empirical 
evidence of their frustrations with this type of work, and to 
use their responses to guide how we design our own research 
studies so that they demonstrate both scientific integrity and 
respect for respondents.

Participants

The pre-screen was accessed 1509 times. Removing cases 
where nothing was entered for one or more of the open-
ended questions and eliminating nonsense or bot-like 
responses to the open-ended questions reduced the num-
ber of useable responses to 1369. Of these, 50.6% (n = 692) 
of respondents were female, 49% (n = 671) were male, and 
.4% (n = 5) identified as being outside the gender binary. 
Participants’ mean age was 37.73 years (SD = 11.69; 
range = 18–82).

Participants were asked to identify which ethnic group(s) 
they belonged to; given that they were able to check mul-
tiple ethnicities, percentages sum to more than 100%. The 
majority of respondents (1093) identified as White (79.8%). 
An additional 120 (8.8%) respondents identified as Black or 
African American; 123 (9.0%) as Asian or Asian Ameri-
can; 83 (6.1%) as Latino/Hispanic; 20 (1.5%) as American 
Indian or Alaskan Native; 9 (0.7%) as Native Hawaiian or 
Pacific Islander; and 5 (0.4%) as Middle Eastern. Nine peo-
ple (0.7%) marked the “other” box.

Procedures

As Wessling et al. (2017) point out, “it is important that 
the screening question[s] be masked by other questions” (p. 
221). So that participants could not easily intuit that this 
was, in fact, a pre-screener for a subsequent study, the pre-
screener questionnaire was titled “Improving my surveys 
for Turkers.” Participants were asked how many academic 
surveys they thought they had completed on MTurk during 
the previous week (none; 1–5; 6–10; 11–14; 15–20; 21–25; 

26–302; and over 30). The modal response was “over 30” 
(n = 641; 46.9%). We asked participants who checked the 
“over 30” response to estimate the number of surveys they 
had completed during the preceding week. The mean num-
ber was 105.88 (SD = 162; median = 69.50). Excluding the 
estimates of 16 respondents whose responses exceeded 
486 (i.e., 3 SD above the mean) reduced the mean to 85.93 
(SD = 60.14; median = 65.00)3.

Participants were also asked two open-ended questions 
regarding their experiences of Turking. Although there 
was no character limit imposed on the length of responses, 
respondents were told that they only needed to write a sen-
tence for each4. Participants were paid 50 cents, for what was 
projected to be a three-minute study5. In this manuscript, 
we focus our analysis on participants’ written responses 
to the question “How could academic surveys on MTurk 
be designed so that frustration for Turkers is reduced/
minimized?”

Coding and data analysis

We used an iterative coding procedure. To begin, each inves-
tigator read the corpus of data to get a broad, holistic sense of 
participants’ experiences. Then, two members of the research 
team conducted the primary, inductive coding and analysis, 
reserving the principal investigator (PI) for the coding confir-
mation check. In the first phase of coding, we reviewed each 
of the 1369 responses to determine what would be the unit 
of analysis. Because single responses could contain several 
frustrations and recommendations, we established the unit 
of analysis to be each discrete frustration or recommenda-
tion mentioned within each response rather than the entire 
response. Specifically, each item (concept) in a string of frus-
trations or recommendations was coded individually.

The first cycle of analysis began with open coding. Open 
coding is a process of assigning a unique label to each chunk 
of data that addresses the research purpose. Rather than reduc-
ing the data through assignment to predetermined categories or 
codes, open coding is an inductive approach that expands the 
corpus of data to allow analysts to identify nuanced findings 

2 These response options reflect an error in survey construction: 
11–14 should have been 11–15, and 15–20 should have been 16–20.
3 The second author recently collected data from two samples via 
Prolific. Respondents from the first sample had completed an aver-
age of 480 studies (ever). Respondents in sample 2 had completed an 
average of 620 studies.
4 Participants answered further questions that were used as covariates 
in the subsequent focal study, which were introduced as “items that I 
hope will help me think about how best to design a study I hope to do 
in the future.”
5 In the interests of full disclosure, three participants who reviewed 
this study noted that although the survey was well intentioned, they 
felt underpaid because it took them twice as long as estimated to 
complete the HIT.


3014 Behavior Research Methods (2023) 55:3009–3025

1 3

and examine the unique parts that make up the whole. An 
advanced graduate student trained in qualitative analysis open 
coded the initial 1369 responses, applying in vivo codes as 
often as possible. In vivo codes use the participant’s own words 
or phrases as the name of the code. This allows the analysts 
to foreground the participant’s meaning during coding while 
mitigating the compulsion to impose their interpretation too 
early. The first cycle of open coding concluded after every par-
ticipant response was assigned at least one code. This process 
yielded 463 unique open codes with 2310 references (units of 
data assigned to codes).

We then engaged in iterative rounds of second-cycle cod-
ing. Second-cycle coding is a process of organizing similar 
open codes, and later, clusters of open codes, into larger, 
increasingly abstract thematic categories—each time assign-
ing a descriptive label reflective of the content to the emergent 
categories. During second-cycle coding, we first clustered 
open codes into 31 categories and then further collapsed them 
into 22 categories that we then organized into six overarching 

themes (see Table 1). At the end of the second cycle, we 
engaged the PI to confirm the coding reliability. The PI was 
given a table with a conceptual definition and prototypical 
exemplar for each category and asked to code 10% of the total 
units of analysis (n = 232) into those categories. As an estab-
lished and conservative approach to evaluating intercoder 
coding reliability, we used Krippendorff’s α (Krippendorff, 
2012). Krippendorff (2012) suggests that α above .667 is ten-
tative and α at or above .8 is preferred. Intercoder reliability 
was excellent in our study, as evidenced by α = .916 for the 
22 subcategories, and α = .955 for the overarching themes.

Findings

For MTurk workers, frustration is commonplace. Indeed, 
there were only 53 instances (2.29%) of participants listing 
no frustrations. Many participants wrote about multiple 
frustrations, with most listing at least 2 (M = 1.7).

Table 1  (Sub)themes of Turking frustrations

The bolded entries reflect the percentages for the overarching category

(Sub)theme Number of references Percentage

Difficulties with survey design and accessibility 37.84%
  Structural and visual issues 384 16.62%
  Did not have a progress or completion bar 117 5.06%
  Should be shorter 108 4.68%
  Should allow more time to complete 85 3.68%
  Should be proofread 55 2.38%
  Survey accessibility 74 3.20%
  Should make it more interesting and engaging 51 2.21%

Frustrations with question design 24.16%
  Repetition of questions 218 9.44%
  Question quality 132 5.71%
  Providing written responses 111 4.81%
  Store answers to common questions in profile 57 2.47%
  Some questions should not be asked 40 1.73%

Fair pay for fair work 13.29%
  Did not pay well 277 11.99%
  Did not pay for qualification questions or failed attention checks 25 1.08%
  Should indicate how long payment will take and pay quicker 5 0.22%

Frustrations due to qualification checks, attention checks, and confirmation codes 11.65%
  Troubles with or about confirmation codes 111 4.81%
  Troubles with or about qualification checks 83 3.59%
  Annoying attention checks 60 2.60%
  Should be more careful when rejecting work 15 0.65%

Desire for clear, accurate, and convenient communication between workers and researchers 10.78%
  Clarity and accuracy of the HIT 232 10.04%
  Should enable more convenient communication between   requestors and workers 17 0.74%

No frustrations 2.29%
  No frustrations 53 2.29%


3015Behavior Research Methods (2023) 55:3009–3025 

1 3

Design features and platform limitations cause 
Turker disengagement

Our first theme, difficulties with survey design and accessi-
bility, represents the most frequently mentioned source of 
frustration—design issues (n = 874; 37.84% of responses). 
The chief source of frustrations in this category concerned 
the structural and visual/aesthetic design of surveys 
(n = 384; 16.62% of all coded responses).

Frustration with the structural and visual/aesthetic design 
of surveys was manifested in various ways. Participants fre-
quently reported feeling that they had been consigned to 
“bubble hell.” That is, they felt “daunted” or put off being 
presented with a page (or pages) of endless “bubbles” (i.e., 
radio buttons presented in grid form) to click. Participants 
also wanted both “bigger” and “fewer” bubbles on a page, 
and often complained about surveys not having go-back 
buttons. In short, they lamented the apparent prioritization 
of function over design, with one respondent explaining: 
“I like to take surveys that appear to be streamlined and 
well designed. Sometimes the appearance is cluttered. I’ve 
returned hits that have too much information or tasks on a 
single page. It makes me lose focus.” Importantly, however, 
respondents noted that the visual presentation of material 
also impeded functionality. For instance, Turkers found it 
frustrating when requesters failed to use “constant labels” 
for scales or did not use a “consistent question (or answer) 
format.” Participants also emphasized the importance of 
making sure the wording for a response (e.g., “highly satis-
fied”) is always visible in addition to the number (“5”). One 
specific recommendation was that researchers “highlight 
choice bubbles with their appropriate columns so that the 
workers do not have to constantly scroll back to the top of 
the page to see what each column and bubble represents.”

Another important frustration raised by participants 
concerned the length of studies, which participants often 
believed were too long and could (or should) be shorter 
(n = 108; 4.68%). This issue was compounded for many 
participants both by the lack of a progress bar to indicate 
how much of the survey remained (n = 117; 5.06%), and 
the fact that the time allotted to complete the task prior to 
being “timed out” was insufficient (n = 85; 3.68%).

Broad issues of survey design were also identified by 
participants who noted their frustration with typographic 
errors resulting from a lack of proofreading (n = 55; 
2.38%). “I think that proofreading/beta-testing surveys 
before they are released would help minimize frustra-
tions,” wrote one participant. Another participant com-
mented, rather pointedly, that “There’s little more gall-
ing than running into attention checks in a study that the 
researchers appear to have barely paid attention to them-
selves.” It is notable, we think, that although research-
ers may question the attentiveness and carefulness of 

participants, numerous participants eloquently articulated 
similar frustrations regarding requesters.

Participants also recommended researchers design sur-
veys with a view to making tasks more interesting, engaging, 
and even entertaining (n = 51; 2.21%). This was not simply 
a matter of aesthetic preference: Turkers felt that this would 
improve their overall experience during a survey, thereby 
facilitating their attentive completion of tasks. In the words 
of one respondent:

A bit of humor helps. I recently took one where some 
ridiculous questions/answers were mixed in, and it 
brightened my mood. Also try to mix things up so that 
it’s not just page after page of questions using the same 
scale. Video and music is always great to see since it's 
more interesting than just reading a lot of text.

Finally, a relatively small proportion of frustrations 
related to issues of survey accessibility deriving from plat-
form limitations of MTurk (n = 74; 3.20%). These largely 
related to issues of compatibility with and/or requirements 
for particular devices, software, or browsers (e.g., “tell me 
it requires Firefox ahead of time so I’m not returning it” 
[i.e., the HIT]).

Frustrations with questions demotivate Turkers

The theme, frustrations with question design, captures 
the second most frequently mentioned source of frustra-
tions (n = 558; 24.16% of all coded responses). Participants 
expressed frustration with the perceived repetitiveness of 
questions, unhelpful response options and formats, and over-
all question quality or content.

Repetition of questions within and across surveys 
topped this list of frustrations (n = 218; 9.44% of all coded 
responses). Although some participants recognized that 
researchers may need to measure complex constructs in mul-
tiple ways or with multiple items, “asking the same questions 
in tons of different ways” was a persistent complaint. One 
respondent observed, for example, that “Spending a huge 
amount of time answering the same questions, only worded 
slightly different, is a real headache,” noting that while they 
“underst[ood] the reason behind it,” they found that “when 
I have to answer the same general question 10 times, my 
attention level drops drastically.” Likewise, another Turker 
explained that “Too many questions are…pretty much the 
same…but differing in tiny ways that are not relevant. Or 
maybe they are, I don’t know, but they tend to get boring.” 
Notably, some participants not only felt wearied by repeti-
tion, but took personal offence at it. One individual wanted 
researchers:

To not repeat questions in different ways, but instead 
phrase questions meaningfully and trust that people 


3016 Behavior Research Methods (2023) 55:3009–3025

1 3

will respond honestly. I feel a bit demotivated when I 
see the same question asked more than once, like the 
integrity of my data is in question.

The quality of questions in general (and quality of 
response options in particular) were another source of 
frustration for Turkers (n = 132; 5.71%). Some complained 
about options that were inexhaustive (“should provide more 
options of answers”) or that did not include a “not applica-
ble” option.

A number of participants reported that they found it 
frustrating to be asked to complete open-ended responses 
or longer writing tasks (n = 111; 4.81%). Several respondents 
simply said that they did not like any writing tasks, com-
menting that they should be avoided or used sparingly. In 
one Turker’s words, “Requesters have to realize…that Turk-
ers loath [sic] any survey that has an extended amount of 
writing in it. So either design your surveys so you aren’t ask-
ing open-ended questions, or incentivize Turkers to answer 
them by increasing pay.”

A fairly common frustration expressed by participants 
concerned the need to complete “boilerplate questions.” 
More specifically, participants found it annoying to have 
to provide basic demographic measures or complete com-
mon scales every time they accepted a HIT, with some 
noting they would prefer “answers to some questions…[to 
be]…stored in the profile and only be taken once” so that 
“repeated demographic or personality questions [could be] 
auto filled.” There were 57 references to this issue (2.47% of 
all responses). Finally, participants felt that “some questions 
should not be asked” (n = 40; 1.73%). They disapproved of 
having to supply “personal” or “unnecessary private infor-
mation” about age, race, finances, and so forth, in some 
cases perceiving this information to be outside of the scope 
of their task.

Turkers desire fair pay for fair work

The third theme, fair pay for fair work (n = 307; 13.29% of 
all coded responses), relates to the amount, fairness, and 
timeliness of payment. This category was dominated by the 
simple frustration that Turking did not pay well (n = 277; 
11.99% of coded responses). As one Turker wrote, “Better 
pay is always needed. A fair wage would compensate for any 
type of task.” Another noted that although, for them, poor 
compensation levels merely affected their “pocket money,” 
others were affected in more profound ways. In their words, 
“$0.10/m is often recommended as good pay but $6/h is an 
awful pay rate. I turk part time for pocket money but that 
rate is below minimum wage and some people do rely on 
that money completely.” Interestingly, some respondents 
observed that poor pay affected researchers negatively by 
forcing crowdsource workers to accept multiple HITs in 

order to cobble together a reasonable hourly wage, which 
necessitated rushing to complete tasks. Explicitly noting the 
association between pay and data quality, one Turker stated:

The pay is what makes me interested in surveys. 
Upping the pay is almost always appreciated—and will 
almost always ensure better, careful results. People on 
Mturk tend to try to make hits pay out to a decent 
hourly. If the hit is underpaid, they will probably rush 
through…whereas if it’s a generous payment they’re 
going to want to give the requester good data. It’s just 
how things are here.

It was important to participants that compensation be 
commensurate with the nature and quantity of work required 
by a HIT. For example, they suggested that longer surveys 
and surveys with writing prompts or other engagement 
activities should be better compensated. Reflecting this, one 
person stated that “I enjoy academic surveys as I assume 
that my opinion truly matters. It’s only the long ones with 
very little compensation where I feel…taken advantage of.”

A number of participants highlighted the injustice of not 
being compensated for completing (often lengthy) quali-
fication and screening questions only to then be deemed 
ineligible for a study. Similarly, participants who devoted 
significant time to completing a survey, but who failed to 
appropriately respond to an attention check question or 
activity, were frustrated by the lack of payment for their 
engagement. From their perspective, they spent time on the 
task and should be compensated for the work they did com-
plete. Twenty-five of the coded responses (1.08%) reflected 
frustration concerning not being paid for qualification 
questions or being denied payment on the basis of failing 
attention checks. A smaller proportion of responses in this 
theme (n = 5; 0.22%) also voiced frustrations related to a 
lack of timely payment. Turkers were leery about accepting 
HITs that did not indicate how long payment would take and 
reported feeling frustrated when payment took longer than 
seven days: There was a strong desire for uncomplicated, 
transparent, and “quick” compensation.

Confirmation codes, attention checks, and (dis)
qualification processes disrupt and distress Turkers

The fourth theme, frustrations due to qualification checks, 
attention checks, and confirmation codes, represents 11.65% 
of all coded responses (n = 269). Having to spend time 
searching for confirmation codes was the most oft-occur-
ring complaint in this theme (n = 111; 4.81% of all coded 
responses). Simply put, participants wanted requesters to 
display confirmation codes clearly and prominently, and to 
ensure that they are actually provided.

Turkers also found it frustrating when there were “too 
many” qualification questions, or when such questions 


3017Behavior Research Methods (2023) 55:3009–3025 

1 3

appeared late in a task (n = 83; 3.59%). As one participant 
explained, “If there are qualifications, list them ahead of 
time. I sometimes take surveys then am rejected afterwards 
because I didn’t meet screening criteria when there were 
none listed.” Annoyance with attention checks6 featured 
almost as prominently (n = 60; 2.60%) and elicited more 
detailed feedback. Turkers felt that attention checks dis-
rupted the research process and distracted workers from their 
task, as seen in the examples below:

The frequency of attention checks could be reduced, 
especially when they are placed in a long psychologi-
cal survey where it breaks up the flow of reading the 
questions.
Don’t over complicate attention checks, whenever I see 
a complicated attention check all I think about is the 
next one coming. It takes the focus off of the survey 
and adds unnecessary anxiety.

It is important to note that some participants found atten-
tion checks to be not only distracting or anxiety-provok-
ing, but patronizing. One participant, for example, wrote 
that “attention checks could be replaced with smarter and 
less condescending strategies to ensure data quality (e.g., 
consistency check, 2-step recruitment).” At the very least, 
respondents noted that “new and exciting attention checks 
could help.”

Finally, frustration was expressed over the process of dis-
qualification (n = 15; 0.65%). Participants pointed out that 
having their work rejected may lower their reputation rating. 
They implored requestors to be careful when rejecting work, 
and to provide reasonable explanations for doing so. As one 
Turker put it:

Please don’t reject the work unless absolutely nec-
essary. We want to do a good job and if a requester 
rejects for no reason and won't talk to us, we worry 
about keeping up our good percentages. It’s best to 
reject the survey at the beginning if an attention check 
is missed. Just stop us from getting bad marks instead 
of letting us do the whole survey then rejecting it.

Turkers want clear, accurate, and convenient 
communication with requestors

The fifth theme, desire for clear, accurate, and convenient 
communication between workers and researchers, centers 

on frustrations related to communication between Turkers 
and requestors (n = 249; 10.78% of all coded frustrations). 
Failure to communicate accurate time estimates was the 
most prominent category within this theme, although the 
broader point that emerged (n = 232; 10.04% of all coded 
frustrations) was simply that Turkers wanted clear, accu-
rate, and convenient communication with researchers from 
the point at which they were recruited to the point at which 
they were paid.

Turkers rely on the information provided about a HIT to 
make a decision about whether to accept it. As such, they 
wanted an “accurate and honest” communication up front 
about how long a HIT will take, what they are expected to 
do (e.g., providing written responses or accessing external 
URLs), and whether there were technical requirements for 
completing a HIT, such as using a certain browser or down-
loading additional software. For some Turkers, poor or inac-
curate communication was stressful as well as frustrating.

The biggest annoyances are misleading time estimates 
and surveys not mentioning there is writing involved 
in the description. I think requesters should have a 
few people take their survey first to get an accurate 
time estimate and then base the pay on that estimate, 
because when they say “This survey takes 10–15 min-
utes” and it winds up taking only 2 minutes or so, I get 
paranoid thinking they might reject me for working 
too quickly.

A relatively small number of the frustrations expressed 
by Turkers (n = 17; 0.74%) concerned problems relating to 
the lack of convenient ways of communicating with request-
ers. For example, Turkers reported difficulty communicating 
with researchers especially when they “have a problem,” 
with one Turker recommending that it needed to be “easier 
for requesters and workers to communicate about hits.” Oth-
ers expressed frustration with the lack of accessible contact 
information and reported that researchers were unresponsive 
to their inquiries. For example, one Turker wanted “a better 
way to report if a code was not found; half the time [I] never 
heard back from requesters.”

Discussion and recommendations

We begin our discussion with a summative comment from 
one participant that encompasses many of the frustrations 
represented in the thematic findings.

Oh, thank you for asking! First, make certain you’ve 
allowed enough time. This is one top peeve for turk-
ers. Second, just as important—don’t lie or care-
lessly throw out a time estimate for your study, that 
winds up being incorrect. Requesters will be cruci-

6 Kung et  al. (2018) provide a helpful clarification of the differ-
ences between attention checks (ACs) and instructional manipulation 
checks (IMCs), considering the latter to be a specific type of the for-
mer. However, although researchers may differentiate between these 
terms, the participants in our study referred exclusively to attention 
checks.


3018 Behavior Research Methods (2023) 55:3009–3025

1 3

fied in their reviews. Also, make sure the directions 
are crystal clear. Ambiguity is incredibly annoying! 
Also, very important, do not forget the code! Also, 
make sure every part of your hit works before you 
let it loose on the community. We also hate it when 
questions are super repetitive or when there is what 
we call a “bubble hell”. Bubble hell does not pro-
mote interest in your study. It promotes boredom and 
a desire to leave the study. I think that’s all I can 
think of.

Even before the outbreak of COVID-19, crowdsourc-
ing platforms had become the primary means of data col-
lection for some researchers. They also provided a sup-
plementary (sometimes primary) source of income—as 
well as opportunities to contribute to scientific endeav-
ors—for participants. Now that once-abstract terms such 
as “lockdown” and “social distancing” have come to 
define many individuals’ experiences during 2020 and 
2021, it is more important than ever to reconsider the 
symbiotic relationship that exists between researchers 
and participants as they navigate vastly changed social, 
academic, and economic environments. By examining 
and extrapolating from the frustrations voiced by MTurk 
workers, we identify several factors researchers could 
(and, we believe, should) take into account when design-
ing studies to be conducted via crowdsourcing platforms. 
Having had the opportunity to read almost 1400 Turk-
ers’ comments on what they find frustrating about their 
work—and for many, it is work, which has attendant 
implications for the degree to which researchers should 
strive to be good employers—we urge researchers to 
reflect on how their online research practices can better 
establish a climate of trust and demonstrate respect for 
participants’ time, voice, and labor.

Before turning to the specific recommendations, we 
wish to highlight a key issue that underlies many of the 
frustrations expressed by Turkers: Regardless of how 
individual researchers perceive themselves as treating 
participants from MTurk and similar platforms, par-
ticipants’ discourse frequently revealed their percep-
tion that researchers (as a collective) do not value their 
time, intelligence, knowledge, or capabilities. To some 
degree, unfortunately, these assumptions may be justi-
fied. Consider, for example, that an early study (Buhrm-
ester et al., 2011) explored whether or not participants 
could be recruited to complete a 30-minute survey when 
offered just two cents to do so. In fairness, this was only 
one of several aims, and these authors’ analyses indicated 
that—in the early days of MTurk—participants were 
intrinsically motivated (e.g., for enjoyment) rather than 
extrinsically motivated. Nonetheless, even if Buhrm-
ester et al.’s intent was to highlight that “workers are not 

driven primarily by financial incentives” (p. 4), we sus-
pect that in many of the eleven-thousand-and-counting 
manuscripts that have cited this article, this was not the 
obvious message. Rather than interpreting Buhrmester 
et  al.’s findings as suggesting that researchers might 
consider how the design of their studies can capitalize 
on participants’ high levels of intrinsic motivation, we 
believe that what many have taken from this report is 
that “workers are willing to complete simple tasks for 
virtually no compensation” (p. 4).

Recommendation 1. Recognize that Turkers are 
sensitive to the substance of a HIT

About a fifth of frustrations concerned what Turkers were 
asked to do during HITs. Turkers were frustrated by ques-
tions that required written responses (4.81%) (or were poorly 
formulated; 5.71%) and irked by attention checks (2.6%). 
They felt requesters asked inappropriate questions (1.73%) 
(or ones whose answers should be available from a central-
ized repository; 2.47%) and found HITs to be uninteresting 
(2.21%).

Open‑ended questions

We recommend researchers use open-ended questions spar-
ingly and signal their use in HIT descriptions. This is espe-
cially important for Turkers using mobile devices and for 
Turkers who accept a HIT not expecting to have to write 
anything and only find doing so is required after investing 
time in the HIT. As one participant put it: “Announcing that 
there will be an open-ended question is nice, but not always 
needed. However, going through 5–15 minutes of a survey 
then seeing a couple of essay questions with a large word 
count or several writing prompts isn't fair. It feels like my 
information is being stolen as I will return many of these 
hits.” Other Turkers did not find open-ended questions inher-
ently frustrating but resented how much time and effort they 
required given the rate of pay. Commenting on HITs where 
requesters ask for an essay “as vivid and detailed as possi-
ble” but only offer a “30 cent payoff,” one Turker noted that 
“I used to do these in the early days before I knew better. 
Now I just return them.”

Open-ended questions can put a high cognitive load on 
participants (Zuell et al., 2015) and adversely affect comple-
tion rates (Liu & Wronski, 2018). Indeed, Crawford et al. 
(2001) found that almost a third of respondents who quit a 
survey did so when shown the first of a set of open-ended 
questions. The risk of nonresponse can, however, be partly 
mitigated by clarifying why an open-ended question is being 
asked, stressing the value of a participant’s response, and 
avoiding the use of dauntingly large text boxes (Müller et al., 
2014; Zuell et al., 2015).


3019Behavior Research Methods (2023) 55:3009–3025 

1 3

Attention checks

Only 2.6% of responses directly concerned attention checks 
(ACs). However, the language used to describe them was 
strong (e.g., bullshit, condescending, deceptive, disrespect-
ful, gotcha, malicious, sneaky, trick, unfair), which suggests 
that using ACs risks creating what one Turker described as an 
“adversarial” relationship between researchers and respond-
ents. Downs et  al. (2010) argued that many ACs violate 
Gricean norms of communication, “requiring careful atten-
tion to normally predictable information…set[ting] a tone of 
distrust for the remainder of the task” (p. 2400). ACs certainly 
generated “implicatures” among some of our respondents, who 
found them demeaning and demoralizing, and attributed their 
use either to researchers’ laziness or lack of consideration.

Our participants did not call for researchers to stop using 
ACs entirely. Indeed, some noted that they understand their 
necessity (“I understand and appreciate the need for attention 
checks. I agree with attention checks”; “You need to main-
tain the quality of your data, so attention checks…are neces-
sary”). They did, however, voice multiple, specific frustrations 
with ACs. For instance, it annoyed Turkers when researchers 
included too many ACs or let them continue a study after 
failing an AC only to later reject their work. Turkers were 
also frustrated when researchers put ACs in consent forms or 
demographic pages that “people have seen 1000 times,” don’t 
ensure ACs work properly (“select option X…no option X to 
select”), test recall of distal information rather than attentive-
ness, or “fake-out” participants. As one Turker explained:

There will be a passage of text, and in the middle it 
will say, actually disregard that, please type xyz in the 
answer box…Most people would stop reading here and 
type xyz because they believe they’ve found an atten-
tion check…However I’ve seen some requesters add 
contradicting attention checks later on the same page, 
i.e. ‘disregard the passage and the former instructions 
and actually, ~actually~ for real this time type abc’.

Academics often infer data quality or study engagement 
from participants’ responses to ACs. However, includ-
ing ACs may introduce new threats to validity (Hauser & 
Schwarz, 2015) and weed out only the “most egregious” of 
participants (Downs et al., 2010, p. 2400). We believe that—
by implying disdain and distrust—researchers’ unthinking 
use of ACs can induce ill will among participants. We rec-
ommend researchers consider carefully whether and how to 
use ACs instead of taking for granted that they will enhance 
data quality.

Overly familiar content

Turkers find it frustrating to complete the same survey items 
time and again. Therefore, it behooves researchers to balance 

their need to use validated measures with participants’ need 
for novelty. Doing so may alleviate participant frustration 
while perhaps also mitigating the threat to validity posed 
by participant non-naïveté (Chandler et al., 2014). Turkers 
also found it frustrating to have to provide the same demo-
graphic information in HIT after HIT, and some stated that 
they would rather complete a demographic profile that is 
automatically appended to submissions. Currently, Prolific 
lets researchers download certain demographic information 
about participants (e.g., sex, age), but this is not possible 
on MTurk. However, with researcher permission, CloudRe-
search adds questions to HITs to monitor the consistency in 
participants’ reported demographic characteristics over time. 
With appropriate privacy safeguards, Turkers may appreci-
ate CloudResearch (or Amazon themselves) making “on-
record” demographic information available to researchers 
so that they do not have to enter it so often.

Asking questions that shouldn’t be asked

Although MTurk prohibits requesters from asking for email 
addresses, phone numbers, social media handles, et cet-
era (https:// www. mturk. com/ accep table- use- policy), some 
requesters do ask for personal data. And, perhaps because 
they fear losing access to work or have been induced (or 
coerced) into doing so by a high rate of pay or a bonus, some 
workers acquiesce to these requests. A respondent in San-
non and Cosley’s (2018) study, for instance, explained their 
reluctant decision to disclose personal information, saying 
they were “homeless at the time,” and, because they “really 
needed the money…[they] went and did it anyways” (p. 3).

Researchers who require participants to provide per-
sonal information are (on some crowdsourcing platforms) 
violating the terms of service. Just as importantly, they are 
violating participants’ right to privacy. And—if payment is 
perceived as being contingent on acquiescing to such infor-
mation requests—they are likely violating ethical principles 
of non-coercion.

Recommendation 2. Don’t underestimate 
the importance of stylistic elements of a HIT

A large proportion (19%) of frustrations related not to the 
substantive nature of a HIT but its presentation. Turkers 
often voiced discontent with seemingly banal structural and 
visual issues (16.62%) such as font size, having too few or 
too many questions per page, being confronted with “end-
less pages” of bubbles, as well as with poor proofreading 
(2.38%). Good aesthetic design should not be an after-
thought: It can improve participants’ willingness to com-
plete a study (Biffignandi & Bethlehem, 2021) and facilitate 
their processing of questions (Casey & Poropat, 2014). Poor 
aesthetic design, on the other hand can trigger “negative 

https://www.mturk.com/acceptable-use-policy


3020 Behavior Research Methods (2023) 55:3009–3025

1 3

visceral responses and, thus, emotional reactions” from par-
ticipants that reduces data quality (Mahon-Haft & Dillman, 
2010, p. 43). One participant in our study noted, for exam-
ple, that having “too many words on one page…makes me 
feel very overwhelmed.” Arguably, overwhelmed people 
may be relatively unlikely to provide high-quality data.

Grids and matrices

“Bubble hell” describes studies that feature lots of scale 
items in a grid or matrix, with responses being entered by 
selecting a radio button. Such formats may be especially 
onerous for participants using smartphones (especially if 
they must scroll for the whole matrix to be visible) (Biffig-
nandi & Bethlehem, 2021; Dillman et  al., 2014), and 
research suggests that lengthy bubble-based surveys may 
have higher rates of nonresponse and straight-lining (i.e., 
answering with a response set) (Liu & Cernat, 2018; Mül-
ler et al., 2014). Unfortunately, respondents in our study 
also reported frustration with alternatives to matrices (such 
as drop-down lists and sliders). This reflects findings that 
drop-down lists are more difficult and time-consuming for 
respondents to use than are radio buttons and can “result in 
more accidental selections” (Müller et al., 2014). It may not 
be possible (or desirable) for researchers to avoid using bub-
ble questions. However, ensuring the meaning of response 
options stays visible and breaking up walls of text with white 
space may reduce participants’ sense that they have been 
consigned to bubble hell.

Pagination

Respondents rarely stated how many items per page they 
considered too many or too few, but even if they had, we 
doubt there were would have been agreement on an exact 
figure. Our view is that it is probably better to err on side 
of having too few items per page, because including high 
numbers of items on a single screen can increase nonre-
sponse rates (Toepoel et al., 2009) whereas spreading items 
over multiple pages (and including clear section headers) can 
make it easier for participants to cognitively process survey 
questions (Müller et al., 2014).

Recommendation 3. Reward participants’ time, help 
them manage it, and make good use of it

Turkers voiced frustration about several related issues: 
Compensation (hourly pay rate, 11.99%; unpaid screeners, 
1.08%), obstacles to being paid (missing confirmation codes, 
4.81%; getting rejected, 0.65%), excessive survey length 
(4.68%) and question repetition (9.44%), and difficulty in 
knowing whether a HIT is worth their time (inaccurate 

estimates of time required by HIT descriptions; absence of 
progress bars (5.06%). Collectively such frustrations consti-
tuted at least 37.71% of those raised7.

Remuneration, withheld pay, and rejected work

Crowdsourcing has changed the relationship between 
researcher and participant into one wherein “the requester 
is a client and the participant a contractor” (Gleibs, 2017, 
p. 1337). Turkers see completing HITs as a job and request-
ers as employers—albeit ones who do not always pay or 
treat them fairly. Whereas Turkers were once thought to be 
motivated mainly by nonfinancial factors (Buhrmester et al., 
2011), “monetary compensation is now…the primary rea-
son” for Turking (Litman et al., 2015, p. 519). This shift 
may explain why, despite early findings that the quality of 
data provided by Turkers was largely unaffected by rates of 
pay (Buhrmester et al., 2011), recent findings suggest that 
compensation does affect data quality (Robertson & Yoon, 
2019). One very simple recommendation, therefore, is that 
researchers offer fair pay that meets community norms and 
appropriate local/regional level standards.

When Brawley and Pury (2016) asked Turkers to describe 
a time when they were dissatisfied with how a requester 
treated them, rejection of work was the dominant theme: As 
their reputation scores grant entrance to well-paying HITs, 
workers are protective of their scores and resent it when they 
are lowered as a result of unjustly being given “black marks” 
(i.e., rejections). In our study, fewer than 2% of respondents 
reported frustration with requestors not taking sufficient care 
when rejecting work or not paying them for work attempted. 
Still, we consider these to be important issues, especially in 
light of Brawley and Pury’s findings and the ethical impli-
cations of not paying for partial completions or rejecting 
participants’ work.

The ethics of rejecting work from Turkers and workers 
on other platforms is murky. In offline studies conducted 
in university classrooms and labs, participants are rarely 
rejected for poor performance or (presumed) inattentive-
ness (Gleibs, 2017). Further, invoking the right to withdraw 
without penalty is relatively straightforward in such studies: 
If a participant is uncomfortable, they can explain this to 
the researcher, leave the lab, and still expect to be compen-
sated. Despite the wording often included in consent forms, 
this is less likely in the world of crowdsourced research. 
Complicating questions of whether and when it should be 

7 This is a conservative estimate as it does not include any of the 10% 
of frustrations relating to the clarity and accuracy of HIT description: 
We did not micro-code these frustrations for specific areas of opacity 
or inaccuracy, but a significant portion of them addressed inaccurate 
estimates of how long a HIT would take to complete.


3021Behavior Research Methods (2023) 55:3009–3025 

1 3

permissible for researchers to reject work8 is the fact that 
rejection is the only means of exercising quality control on 
MTurk. Researchers often allow only workers with a suf-
ficiently high reputation score to participate in their studies. 
As such, they rely on their colleagues rejecting “poor” work 
so they, in turn, can be confident in the quality of the data 
they collect (Peer et al., 2014).

Two reasons for having work rejected irked our partici-
pants. The first was when work was rejected after failing an 
attention check, especially if they had already invested time 
in the HIT or were allowed to complete the HIT only to later 
be rejected. The second was when they were rejected for com-
pleting tasks too quickly. One participant wanted research-
ers to realize that “someone who does this all day every day 
is much faster at completing…and isn’t simply not paying 
attention.” Another noted that Turkers are “going to be faster 
than the graduate students you had pilot testing your survey” 
because “it’s their job,” and that they shouldn’t be “penalized” 
for this. Such sentiments echo Robertson and Yoon’s (2019) 
conclusion that “the amount of time…an MTurk participant 
spends on a task may be a poor proxy for effort” (p. 1656).

Progress bars

That more than 5% of frustrations concerned the failure to 
include a progress bar reflects that progress bars are appre-
ciated and desired by respondents (Heerwegh & Loosveldt, 
2006; Müller et al., 2014). To the extent that participants 
who may be on the verge of exiting a study due to survey 
fatigue may be encouraged to stay the course if the end of the 
study “draws visibly nearer with every question answered” 
(Heerwegh & Loosveldt, 2006, p. 194), progress bars can 
also benefit researchers. However, progress bars can also 
reduce study completion rates by de-motivating and discour-
aging participants. This is particularly the case if they sug-
gest that more of a survey remains than is actually the case 
(Biffignandi & Bethlehem, 2021; Crawford et al., 2001); a 
problem may be exacerbated in surveys that are long or use 
skip logic (Müller et al., 2014).

Liu and Wronski (2018) recently found that completion 
rates were highest when no progress bar was displayed. 
However, although the differences in their study were sig-
nificant (as might be expected in a sample of > 25,000), they 
were not large. The completion rate was 87.5% when there 
was no progress bar, and 86.8% when a progress bar was 
included at the top of a page: If 400 people began a study, 
this would yield final samples that differed only by three 

persons. This may be a price worth paying if it lets Turkers 
(who Amazon consider legally equivalent to self-employed 
people or contractors; Gleibs, 2017) judge whether a task is 
worth completing. However, because built-in progress bars 
can be inaccurate, researchers should consider including 
additional or alternative markers of progress (e.g., explicit 
statements along the lines of “Part 2 of 4,” “You are about 
two-thirds of the way through the survey,” “There is one 
more set of questions before we collect your demographic 
details”). Some participants explicitly recommended com-
bining these sorts of comments with efforts to be encourag-
ing, stating, for example, “I love a little encouragement—
as in ‘You're doing great—only a few more questions!’ It 
makes me feel like people really care about what I'm doing.”

Survey length and repetition

Long surveys can discourage participation, cause poorer 
completion rates among people who do participate, and 
elicit hastier, shorter, and more “uniform” responses to items 
placed later in the survey (Galesic & Bosnjak, 2009; Liu 
& Wronski, 2018; Marcus et al., 2007). In a recent study, 
the median ideal and maximum survey lengths reported by 
participants were 10 and 20 minutes, respectively (Revilla 
& Ochoa, 2017). Few participants in our study specified an 
upper limit on study length, but a significant number felt 
surveys should be shorter. Piloting a study (using the same 
pool of workers who will participate in the main study) to 
ensure the survey length falls in reasonable bounds (relative 
to the compensation offered) is, therefore, advisable.

Representing more than 9% of frustrations were remarks 
like “There are too many synonyms used in the questions 
sometimes…I feel I am answering the same question over 
and over again.” Annoyance at the repetitiveness of ques-
tions was, therefore, far more prevalent than was annoyance 
at the mere length of a survey. This suggests that it is not just 
how much of a participant’s time a survey takes that matters, 
but how well that participant’s time is used.

Researchers may take for granted that multi-item (MI) 
measures of constructs (which contribute to the repetitive-
ness that frustrates participants) are inherently superior to 
single-item (SI) measures. They may also believe using MI 
scales to be a prerequisite for placing articles in top jour-
nals for whom “measurement reliability of the coefficient 
alpha kind…[is] sacrosanct” (Drolet & Morrison, 2001, p. 
196). However, although MI measures are often psychomet-
rically superior to their SI counterparts (Diamantopoulos 
et al., 2012; Sarstedt & Wilczynski, 2009), they do have 
drawbacks that can be offset by using SI scales. Drolet and 
Morrison (2001) argue that MI scales add “little information 
over a one- or, at most, two-item scale” (p. 198), because as 
items are added participants are more apt to ignore differ-
ences between items. Moreover, the “repetitive and onerous” 

8 We are not suggesting here that researchers be obligated to accept 
and pay for work that shows every sign of having been performed by 
a bot or that clearly represents deception on the part of a respondent 
(e.g., as evidenced by repeated submissions or inconsistent reporting 
of demographic characteristics).


3022 Behavior Research Methods (2023) 55:3009–3025

1 3

(Robinson, 2018, p. 742) nature of MI scales can prompt 
straight-lining (an undesirable form of responding to which 
SI instruments are less vulnerable) and reduce response 
rates. SI scales lessen participant fatigue, resulting in higher 
response rates that may compensate for psychometric short-
comings (Sarstedt & Wilczynski, 2009).

Participants in our study were willing to tolerate some 
repetition, but disliked being asked “the same question sev-
eral times with one word difference each time where the 
words are synonyms.” Such phrasing hints at a significant 
problem. MI measures permit the easy calculation of reli-
ability coefficients, and for the purpose of satisfying manu-
script reviewers, the higher those coefficients are, the better. 
When researchers characterize an instrument as having a 
high level of reliability, however, it would often be more 
accurate to say that they are acknowledging “a high level 
of item redundancy wherein essentially the same item is 
repeated in several different ways” (Boyle, 1991, p. 291). 
Countering the “more is better” mentality ingrained in many 
researchers, when inter-item correlations or alpha reliabili-
ties are too high, this may be evidence that a scale is “too 
narrow and too specific” (Boyle, 1991, p. 291) and fails to 
capture the full scope of the construct of interest. Many 
researchers were taught that a high reliability coefficient is 
necessary but not sufficient for scale validity. Far fewer, we 
suspect, (us included) learned that a high reliability coef-
ficient may be a sign that a measure lacks validity.

These comments are intended neither as a screed against 
all MI instruments or a blanket endorsement of SI scales. 
Rather, we hope they will encourage researchers to reflect 
on whether MI or SI scales are more appropriate for their 
purposes (Diamantopoulos et al., 2012; Rossiter, 2002). 
However, if a good, SI measure is available, we recommend 
researchers use it and reviewers support this choice. For 
example, when Robins et al. (2001) compared their single-
item self-esteem measure and the Rosenberg Self-Esteem 
scale, they found that “disattenuated correlations were near 
unity” and the scales shared “almost all of their reliable 
variance” (p. 426). In such cases, we see little reason to ask 
participants ten questions instead of one.

Limitations and future directions

Our study has several limitations. First, it relies on self-
report data. Although we believe participants answered our 
questions in good faith, their remarks should be taken in 
conjunction with empirical findings. For example, although 
participants indicated that they would put forth more effort 
or be more attentive if they were better compensated, find-
ings on this matter have been mixed. This does not, however, 
negate the ethical imperative to offer fair levels of pay.

Second, although we asked participants to provide 
open-ended accounts of frustrations with Turking, this was 

incidental to the primary purpose of the survey in which 
they participated (which was fielded to collect demographic 
data that would allow the first author to invite participants 
meeting eligibility criteria to participate in a separate study). 
Further work investigating the experiences of crowdsource 
workers would, therefore, be valuable. Arguably, for exam-
ple, the power differential in the status accorded requesters 
and workers, and the level of remuneration offered to Turk-
ers (and other crowdsource workers), suggests that there is 
little prestige attached to Turking; Gleibs (2017) considers 
Turking to be something that yields “a low-paid service 
income” (p. 1336). As such, research informed by scholar-
ship on “dirty work” (Ashforth & Kreiner, 1999) might illu-
minate how Turkers discursively construct their identities.

Third, it is unclear how well our findings generalize from 
Turkers to crowdsource workers on other platforms that dif-
fer in important ways. For example, Prolific has a UK-centric 
participant pool, an emphasis on academic research, and a 
particular ethos regarding the treatment of participants. 
Researchers might, therefore, study participants from other 
crowdsourcing populations to compare positive and negative 
user experiences across platforms.

Finally, we recommend that researchers analyze which 
measures most often appear in crowdsourced studies. Just as 
meta-scientists have reported on the proportion of studies in 
their field that are conducted via crowdsourcing, we encour-
age micro-level research that assesses, for instance, the pro-
portion of studies that ask participants to complete the Posi-
tive and Negative Affect Schedule (PANAS) or Rosenberg 
Self-Esteem inventory. This work would expand meaningfully 
on Marder and Fritz’s (2015) journalistic inquiry regarding 
the questions most often encountered by “Super Turkers.”

Conclusions

Crowdsourcing offers researchers in the social and behav-
ioral science the opportunity to move beyond the college 
student samples that have been the mainstay of so much 
of our work. It also presents a new set of challenges for 
researchers to navigate. The Turkers we recruited were quite 
forthcoming about their experiences, and often seemed to 
appreciate having the chance to offer feedback on issues that 
researchers (including ourselves) may not have appreciated 
were so important to them. We hope that the present study 
may be helpful for colleagues as they seek to implement 
respondent-centered research practices when using crowd-
sourcing platforms such as MTurk. We close by reiterating 
Gleibs’ (2017) recommendation:

First and foremost we should understand MTurk work-
ers (or other members of crowdsourcing platforms) not 
as “subjects” or anonymous workers who provide us 
with easily accessible data, but as active participants 


3023Behavior Research Methods (2023) 55:3009–3025 

1 3

who make important contributions to our work and 
research in general. (p. 1338)

Funding Open Access funding enabled and organized by CAUL and its 
Member Institutions Funding for this study was provided by a research 
fund grant to the first author from Massey Business School.

Declarations 

Conflicts of interest The authors declared no conflicts of interest

Ethics approval The study was submitted as low risk and approved 
as such by the first author’s university (Ethics Notification Number: 
4000019423).

Consent to participate The following statements were provided to 
participants at the outset of the study. Continuing with the survey 
after reading these statements was interpreted as the tacit provision 
of informed consent.

I’ve already used MTurk a few times to collect data but want to learn 
a little more about what it's like to be a Turker so I can improve the 
design of my surveys. I’ve split the questions into three sections—I’m 
expecting each section will take about a minute to work through. I'll 
ask a few closed-ended questions and 2 open-ended ones (you only 
need to write a sentence for each unless there's something you want to 
get off your chest!). In the first section I’ll ask a few questions about 
your use of MTurk. In the second section, I’ll ask for your demographic 
information. In the third section, I ask you to respond to a few short 
items that I hope will help me think about how best to design a study I 
hope to do in the future.
If you’d like to contact me about any aspect of this project, you 
can email me at _________, or call me at ___________, exten-
sion _______. For the first time, I'm running this study through the 
TurkPrime service, so if this causes you trouble, please let me know.
If you have any concerns about the conduct of this research, please 
contact either myself or the Director of Research Ethics at my institu-
tion by emailing _________________.

Open Access This article is licensed under a Creative Commons Attri-
bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long 
as you give appropriate credit to the original author(s) and the source, 
provide a link to the Creative Commons licence, and indicate if changes 
were made. The images or other third party material in this article are 
included in the article's Creative Commons licence, unless indicated 
otherwise in a credit line to the material. If material is not included in 
the article's Creative Commons licence and your intended use is not 
permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. To view a 
copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.

References

Ahler, D. A., Roush, C. E., & Sood, G. (2021). The micro-task market 
for lemons: Data quality on Amazon’s Mechanical Turk. Politi-
cal Science Research and Methods. https:// doi. org/ 10. 1017/ psrm. 
2021. 57

Ashforth, B. E., & Kreiner, G. E. (1999). “How can you do it?” Dirty 
work and the challenge of constructing a positive identity. Acad-
emy of Management Review, 24(3), 413–434. https:// doi. org/ 10. 
2307/ 259134

Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online 
labor markets for experimental research: Amazon.com’s Mechani-
cal Turk. Political Analysis, 20(3), 351–368. https:// doi. org/ 10. 
1093/ pan/ mpr057

Biffignandi, S., & Bethlehem, J. (2021). Handbook of web surveys 
(2nd ed). Wiley.

Boyle, G. J. (1991). Does item homogeneity indicate internal consist-
ency or item redundancy in psychometric scales. Personality and 
Individual Differences, 12(3), 291–294. https:// doi. org/ 10. 1016/ 
0191- 8869(91) 90115-R

Brawley, A. M., & Pury, C. L. S. (2016). Work experiences on MTurk: 
Job satisfaction, turnover, and information sharing. Computers 
in Human Behavior, 54, 531–546. https:// doi. org/ 10. 1016/j. chb. 
2015. 08. 031

Brown, N. (2015). Mechanical Turk: Amazon’s new charges are not the 
biggest problem. Retrieved July 24, 2017, from http:// steam traen. 
blogs pot. co. nz/ 2015/ 06/ mecha nical- turk- amazo ns- new- charg es- 
are. html?m=1

Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s 
Mechanical Turk: A new source of inexpensive, yet high quality, 
data? Perspectives on Psychological Science, 6(1), 3–5. https:// 
doi. org/ 10. 1177/ 17456 91610 393980

Buhrmester, M. D., Talaifar, S., & Gosling, S. D. (2018). An evaluation 
of Amazon’s Mechanical Turk, its rapid rise, and its effective use. 
Perspectives on Psychological Science, 13(2), 149–154. https:// 
doi. org/ 10. 1177/ 17456 91617 706516

Burnette, C. B., Luzier, J. L., Bennett, B. L., Weisenmuller, C. M., 
Kerr, P., Martin, S., Keener, J., & Calderwod, L. (2022). Concerns 
and recommendations for using Amazon MTurk for eating disor-
der research. International Journal of Eating Disorders, 55(2), 
263–272. https:// doi. org/ 10. 1002/ eat. 23614

Casey, T. W., & Poropat, A. (2014). Beauty is more than screen 
deep: Improving the web survey respondent experience through 
socially-present and aesthetically-pleasing user interfaces. Com-
puters in Human Behavior, 30, 153–163. https:// doi. org/ 10. 1016/j. 
chb. 2013. 08. 001

Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal? 
A comparison of participants and data gathered via Amazon’s 
MTurk, social media, and face-to-face behavioral testing. Com-
puters in Human Behavior, 29(6), 2156–2160. https:// doi. org/ 
10. 1016/j. chb. 2013. 05. 009

Chandler, J. J., & Paolacci, G. (2017). Lie for a dime: When most 
prescreening responses are honest but most study participants 
are imposters. Social Psychological and Personality Science, 
8(5), 500–508. https:// doi. org/ 10. 1177/ 19485 50617 698203

Chandler, J., & Shapiro, D. (2016). Conducting clinical research 
using crowdsourced convenience samples. Annual Review of 
Clinical Psychology, 12(1), 53–81. https:// doi. org/ 10. 1146/ 
annur ev- clinp sy- 021815- 093623

Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaivete among 
Amazon Mechanical Turk workers: Consequences and solutions 
for behavioural researchers. Behavioral Research, 46, 112–130. 
https:// doi. org/ 10. 3758/ s13428- 013- 0365-7

Chmielewski, M., & Kucker, S. C. (2019). An MTurk crisis? Shifts in 
data quality and the impact on study results. Social Psychologi-
cal and Personality Science, 11(4), 464–473. https:// doi. org/ 10. 
1177/ 19485 50619 875149

Christenson, D. P., & Glick, D. M. (2013). Crowdsourcing panel 
studies and real-time experiments in MTurk. The Political Meth-
odologist, 20(2), 27–32.

http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1017/psrm.2021.57
https://doi.org/10.1017/psrm.2021.57
https://doi.org/10.2307/259134
https://doi.org/10.2307/259134
https://doi.org/10.1093/pan/mpr057
https://doi.org/10.1093/pan/mpr057
https://doi.org/10.1016/0191-8869(91)90115-R
https://doi.org/10.1016/0191-8869(91)90115-R
https://doi.org/10.1016/j.chb.2015.08.031
https://doi.org/10.1016/j.chb.2015.08.031
http://steamtraen.blogspot.co.nz/2015/06/mechanical-turk-amazons-new-charges-are.html?m=1
http://steamtraen.blogspot.co.nz/2015/06/mechanical-turk-amazons-new-charges-are.html?m=1
http://steamtraen.blogspot.co.nz/2015/06/mechanical-turk-amazons-new-charges-are.html?m=1
https://doi.org/10.1177/1745691610393980
https://doi.org/10.1177/1745691610393980
https://doi.org/10.1177/1745691617706516
https://doi.org/10.1177/1745691617706516
https://doi.org/10.1002/eat.23614
https://doi.org/10.1016/j.chb.2013.08.001
https://doi.org/10.1016/j.chb.2013.08.001
https://doi.org/10.1016/j.chb.2013.05.009
https://doi.org/10.1016/j.chb.2013.05.009
https://doi.org/10.1177/1948550617698203
https://doi.org/10.1146/annurev-clinpsy-021815-093623
https://doi.org/10.1146/annurev-clinpsy-021815-093623
https://doi.org/10.3758/s13428-013-0365-7
https://doi.org/10.1177/1948550619875149
https://doi.org/10.1177/1948550619875149


3024 Behavior Research Methods (2023) 55:3009–3025

1 3

Crawford, S. D., Couper, M. P., & Lamias, M. J. (2001). Web sur-
veys: Perceptions of burden. Social Science Computer Review, 
19(2), 146–162. https:// doi. org/ 10. 1177/ 08944 39301 01900 202

Diamantopoulos, A., Sarstedt, M., Fuchs, C., Wilczynski, P., & 
Kaiser, S. (2012). Guidelines for choosing between multi-
item and single-item scales for construct measurement: A 
predictive validity perspective. Journal of the Academy of 
Marketing Sciences, 40(3), 434–449. https:// doi. org/ 10. 1007/ 
s11747- 011- 0300-3

Dillman, D. A., Smyth, J. D., & Christian, L. M. (2014). Internet, 
phone, mail, and mixed-mode surveys (4th ed.). John Wiley & 
Sons.

Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are 
your participants gaming the system? Screening Mechanical Turk 
workers. Proceedings of the SIGCHI Conference on Human Fac-
tors in Computing Systems, USA, 4, 2399–2402. https:// doi. org/ 
10. 1145/ 17533 26. 17536 88

Drolet, A. L., & Morrison, D. G. (2001). Do we really need multiple-
item measures in service research? Journal of Service Research, 
3(3), 196–204. https:// doi. org/ 10. 1177/ 10946 70501 33001

Galesic, M., & Bosnjak, M. (2009). Effects of questionnaire length on 
participation and indications of response quality in a web survey. 
Public Opinion Quarterly, 73(2), 349–360. https:// doi. org/ 10. 
1093/ poq/ nfp031

Gleibs, I. H. (2017). Are all “research fields” equal? Rethinking prac-
tice for the use of data from crowdsourcing market places. Behav-
ior Research Methods, 49, 1333–1342. https:// doi. org/ 10. 3758/ 
s13428- 016- 0789-y

Goodman, J. K., & Paolacci, G. (2017). Crowdsourcing consumer 
research. Journal of Consumer Research, 44(1), 196–210. https:// 
doi. org/ 10. 1093/ jcr/ ucx047

Goodman, J. K., Cryder, C. E., & Cheema, A. (2012). Data collection 
in a flat world: Thestrengths and weaknesses of Mechanical Turk 
samples. Journal of Behavioral Decision Making, 26(3), 213–224. 
https:// doi. org/ 10. 1002/ bdm. 1753

Hara, K., Adams, A., Milland, K., Savage, S., Callison-Burch, C., & 
Bigham, J. P. (2017). A data-driven analysis of workers’ earn-
ings on Amazon Mechanical Turk. Proceedings of the 2018 CHI 
Conference on Human Factors in Computing Systems, Paper No. 
449. https:// doi. org/ 10. 1145/ 31735 74. 31740 23

Harms, P. D., & DeSimone, J. A. (2015). Caution! MTurk workers 
ahead–Fines doubled. Industrial and Organizational Psychology, 
8(2), 183–190. https:// doi. org/ 10. 1017/ iop. 2015. 23

Hauser, D. J., & Schwarz, N. (2015). It’s a trap! Instructional manipu-
lation checks prompt systematic thinking on “tricky” tasks. Sage 
Open, 5(2). https:// doi. org/ 10. 1177/ 21582 44015 584617

Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk par-
ticipants perform better on online attention checks than do subject 
pool participants. Behavioural Research, 48, 400–407. https:// doi. 
org/ 10. 3758/ s13428- 015- 0578-z

Heerwegh, D., & Loosveldt, G. (2006). An experimental study on the 
effects of personalization, survey length statements, progress 
indicators, and survey sponsor logos in web surveys. Journal of 
Official Statistics, 22(2), 191–210.

Hitlin, P. (2016). Research in the crowdsourcing age, a case study. Pew 
Research Center, July 2016. Retrieved February 5, 2020, from 
https:// www. pewre search. org/ inter net/ 2016/ 07/ 11/ resea rch- in- the- 
crowd sourc ing- age-a- case- study/

Kees, J., Berry, C., Burton, S., & Sheehan, K. (2017). An analysis of 
data quality: Professional panels, student subject pools, and Ama-
zon’s Mechanical Turk. Journal of Advertising, 46(1), 141–155. 
https:// doi. org/ 10. 1080/ 00913 367. 2016. 12693 04

Krippendorff, K. (2012). Content analysis: An introduction to its meth-
odology. Sage.

Kung, F. Y. H., Kwok, N., & Brown, D. J. (2018). Are attention check 
questions a threat to scale validity? Applied Psychology: An 

International Review, 67(2), 264–283. https:// doi. org/ 10. 1111/ 
apps. 12108

Lindlof, T. R., & Taylor, B. C. (2011). Qualitative communication 
research methods (3rd ed). Sage.

Litman, L., Robinson, J., & Rosenzweig, C. (2015). The relationship 
between motivation, monetary compensation, and data quality 
among US- and India-based workers on Mechanical Turk. Behav-
ior Research Methods, 47(2), 519–528. https:// doi. org/ 10. 3758/ 
s13428- 014- 0483-x

Liu, M., & Cernat, A. (2018). Item-by-item versus matrix questions: A 
web survey experiment. Social Science Computer Review, 36(6), 
690–706. https:// doi. org/ 10. 1177/ 08944 39316 674459

Liu, M., & Wronski, L. (2018). Examining completion rates in web 
surveys via over 25,000 real-world surveys. Social Science Com-
puter Review, 36(1), 116–124. https:// doi. org/ 10. 1177/ 08944 
39317 695581

Mahon-Haft, T. A., & Dillman, D. A. (2010). Does visual appeal mat-
ter? Effects of web survey aesthetics on survey quality. Survey 
Research Methods, 4(1), 43–59. https:// doi. org/ 10. 18148/ srm/ 
2010. v4i1. 2264

Marcus, B., Bosnjak, M., Lindner, S., Pilischenko, S., & Schütz, A. 
(2007). Compensating for low topic interest and long surveys: A 
field experiment on nonresponse in web surveys. Social Science 
Computer Review, 25(3), 372–383. https:// doi. org/ 10. 1177/ 08944 
39307 297606

Marder, J., & Fritz, M. (2015). The internet’s hidden science factory 
[Blog post]. Retrieved July 24, 2017, from https:// www. pbs. org/ 
newsh our/ scien ce/ inside- amazo ns- hidden- scien ce- facto ry

Mehrotra, D. (2020). Horror stories from inside Amazon’s Mechani-
cal Turk. Retrieved February 5, 2020, from https:// www. gizmo 
do. com. au/ 2020/ 01/ horror- stori es- from- inside- amazo ns- mecha 
nical- turk/

Moss, A. J., Rosenzweig, C., Robinson, J., Jaffe, S. N., & Litman, 
L. (2020). Is it ethical to use Mechanical Turk for behavioral 
research? Relevant data from a representative survey of MTurk 
participants and wages. https:// doi. org/ 10. 31234/ osf. io/ jbc9d

Müller, H., Sedley, A., & Ferrall-Nunge, E. (2014). Survey Research 
in HCI. In J. Olson & W. Kellogg (Eds.), Ways of Knowing in 
HCI (pp. 229–266). Springer. https:// doi. org/ 10. 1007/ 978-1- 4939- 
0378-8_ 10

Necka, E., Cacioppo, S., Norman, G. J., & Cacioppo, J. T. (2016). Meas-
uring the prevalence of problematic respondent behaviors among 
MTurk, campus, and community participants. PLos ONE, 11, 
e0157732. https:// doi. org/ 10. 1371/ journ al. pone. 01577 32

Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experi-
ments on Amazon Mechanical Turk. Judgment and Decision Mak-
ing, 5, 411–419. Retrieved May 8, 2015, from https:// ssrn. com/ 
abstr act= 16262 26

Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a suf-
ficient condition for data quality on Amazon Mechanical Turk. 
Behavioral Research, 46, 1023–1031. https:// doi. org/ 10. 3758/ 
s13428- 013- 0434-y

Peer, E., Brandimarte, L., Samat, S., & Acquisti, A. (2017). Beyond 
the Turk: Alternative platforms for crowdsourcing behavioral 
research. Journal of Experimental Social Psychology, 70, 153–
163. https:// doi. org/ 10. 1016/j. jesp. 2017. 01. 006

Pittman, M., & Sheehan, K. (2016). Amazon’s Mechanical Turk a digi-
tal sweatshop? Transparency and accountability in crowdsourced 
online research. Journal of Media Ethics, 31(4), 260–262. https:// 
doi. org/ 10. 1080/ 23736 992. 2016. 12288 11

Revilla, M., & Ochoa, C. (2017). Ideal and maximum length for a 
web survey. International Journal of Market Research, 59(5), 
557–565. https:// doi. org/ 10. 2501/ IJMR- 2017- 039

Robertson, A. Z., & Yoon, A. H. (2019). You get what you pay for: An 
empirical examination of the use of MTurk in legal scholarship. 

https://doi.org/10.1177/089443930101900202
https://doi.org/10.1007/s11747-011-0300-3
https://doi.org/10.1007/s11747-011-0300-3
https://doi.org/10.1145/1753326.1753688
https://doi.org/10.1145/1753326.1753688
https://doi.org/10.1177/109467050133001
https://doi.org/10.1093/poq/nfp031
https://doi.org/10.1093/poq/nfp031
https://doi.org/10.3758/s13428-016-0789-y
https://doi.org/10.3758/s13428-016-0789-y
https://doi.org/10.1093/jcr/ucx047
https://doi.org/10.1093/jcr/ucx047
https://doi.org/10.1002/bdm.1753
https://doi.org/10.1145/3173574.3174023
https://doi.org/10.1017/iop.2015.23
https://doi.org/10.1177/2158244015584617
https://doi.org/10.3758/s13428-015-0578-z
https://doi.org/10.3758/s13428-015-0578-z
https://www.pewresearch.org/internet/2016/07/11/research-in-the-crowdsourcing-age-a-case-study/
https://www.pewresearch.org/internet/2016/07/11/research-in-the-crowdsourcing-age-a-case-study/
https://doi.org/10.1080/00913367.2016.1269304
https://doi.org/10.1111/apps.12108
https://doi.org/10.1111/apps.12108
https://doi.org/10.3758/s13428-014-0483-x
https://doi.org/10.3758/s13428-014-0483-x
https://doi.org/10.1177/0894439316674459
https://doi.org/10.1177/0894439317695581
https://doi.org/10.1177/0894439317695581
https://doi.org/10.18148/srm/2010.v4i1.2264
https://doi.org/10.18148/srm/2010.v4i1.2264
https://doi.org/10.1177/0894439307297606
https://doi.org/10.1177/0894439307297606
https://www.pbs.org/newshour/science/inside-amazons-hidden-science-factory
https://www.pbs.org/newshour/science/inside-amazons-hidden-science-factory
https://www.gizmodo.com.au/2020/01/horror-stories-from-inside-amazons-mechanical-turk/
https://www.gizmodo.com.au/2020/01/horror-stories-from-inside-amazons-mechanical-turk/
https://www.gizmodo.com.au/2020/01/horror-stories-from-inside-amazons-mechanical-turk/
https://doi.org/10.31234/osf.io/jbc9d
https://doi.org/10.1007/978-1-4939-0378-8_10
https://doi.org/10.1007/978-1-4939-0378-8_10
https://doi.org/10.1371/journal.pone.0157732
https://ssrn.com/abstract=1626226
https://ssrn.com/abstract=1626226
https://doi.org/10.3758/s13428-013-0434-y
https://doi.org/10.3758/s13428-013-0434-y
https://doi.org/10.1016/j.jesp.2017.01.006
https://doi.org/10.1080/23736992.2016.1228811
https://doi.org/10.1080/23736992.2016.1228811
https://doi.org/10.2501/IJMR-2017-039


3025Behavior Research Methods (2023) 55:3009–3025 

1 3

Vanderbilt Law Review, 72(5), 1633–1674. Retrieved June 7, 
2022, from https:// schol arship. law. vande rbilt. edu/ vlr/ vol72/ iss5/4

Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Meas-
uring global self-esteem: Construct validation of a single-item 
measure and the Rosenberg self-esteem scale. Personality and 
Social Psychology Bulletin, 27(2), 151–161. https:// doi. org/ 10. 
1177/ 01461 67201 272002

Robinson, M. A. (2018). Using multi-item psychometric scales for 
research and practice in human resource management. Human 
Resource Management, 57(3), 739–750. https:// doi. org/ 10. 1002/ 
hrm. 21852

Roman, Z. J., Brandt, H., & Miller, J. M. (2022). Automated bot 
detection using Bayesian latent class models in online surveys. 
Frontiers in Psychology, 13. https:// doi. org/ 10. 3389/ fpsyg. 2022. 
789223

Rossiter, J. R. (2002). The C-OAR-SE procedure for scale development 
in marketing. International Journal of Research in Marketing, 
19(4), 305–335. https:// doi. org/ 10. 1016/ S0167- 8116(02) 00097-6

Sannon, S., & Cosley, D. (2018). “It was a shady HIT”: Navigating 
work-related privacy concerns on MTurk. CHI EA '18: Extended 
Abstracts of the 2018 CHI Conference on Human Factors in Com-
puting Systems. https:// doi. org/ 10. 1145/ 31704 27. 31885 11

Sarstedt, M., & Wilczynski, P. (2009). More for less? A comparison 
of single-item and multi-item measures. Die Betriebswirtschaft, 
69(2), 211–227.

Schmidt, G. B. (2015). Fifty days an MTurk worker: The social and 
motivational context for Amazon Mechanical Turk workers. 
Industrial and Organizational Psychology, 8(2), 165–237. https:// 
doi. org/ 10. 1017/ iop. 2015. 20

Semuels, A. (2018). The internet is enabling a new kind of poorly paid 
hell. Retrieved January 29, 2020, from https:// www. theat lantic. 
com/ busin ess/ archi ve/ 2018/ 01/ amazon- mecha nical- turk/ 551192/

Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using Mechani-
cal Turk to study clinical populations. Clinical Psychological Sci-
ence, 1(2), 213–220. https:// doi. org/ 10. 1177/ 21677 02612 469015

Sheehan, K. B. (2018). Crowdsourcing research: Data collection with 
Amazon’s Mechanical Turk. Communication Monographs, 85(1), 
140–156. https:// doi. org/ 10. 1080/ 03637 751. 2017. 13420 43

Siegel, J. T., & Navarro, M. (2019). A conceptual replication exam-
ining the risk of overtly listing eligibility criteria on Amazon’s 
Mechanical Turk. Journal of Applied Social Psychology, 49(4), 
239–248. https:// doi. org/ 10. 1111/ jasp. 12580

Stansberry, K. (2020). Measurement in Public Relations. In E. E. Gra-
ham & J. P. Mazer (Eds.), Communication Research Measures III: 
A Sourcebook (pp. 108–119). Routledge.

Stewart, N., Ungemach, C., Harris, A. J. L., Bartels, D. M., Newell, B. 
R., Paolacci, G., & Chandler, J. (2015). The average laboratory 
samples a population of 7,300 Amazon Mechanical Turk workers. 
Judgment and Decision Making, 10, 479–491.

Stewart, N., Chandler, J., & Paolacci, G. (2017). Crowdsourcing sam-
ples in cognitive science. Trends in Cognitive Sciences, 21(10), 
736–748. https:// doi. org/ 10. 1016/j. tics. 2017. 06. 007

Suri, S., Goldstein, D. G., & Mason, W. A. (2011). Honesty in an 
online labor market. Proceedings of the 11th AAAI Conference 
on Human Computation, pp. 61-66.

Toepoel, V., Das, M., & Van Soest, A. (2009). Design of web question-
naires: The effects of the number of items per screen. Field Meth-
ods, 21(2), 200–213. https:// doi. org/ 10. 1177/ 15258 22X08 330261

Wessling, K. S., Huber, J., & Netzer, O. (2017). MTurk character mis-
representation: Assessment and solutions. Journal of Consumer 
Research, 44, 211–230. https:// doi. org/ 10. 1093/ jcr/ ucx053

Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the 
web: How unattended selective attrition leads to surprising (yet 
false) research conclusions. Journal of Personality and Social 
Psychology, 111(4), 493–504. https:// doi. org/ 10. 1037/ pspa0 
000056

Zuell, C., Menold, N., & Körber, S. (2015). The influence of the answer 
size box on item nonresponse to open-ended questions in a web 
survey. Social Science Computer Review, 33(1), 115–122. https:// 
doi. org/ 10. 1177/ 08944 39314 528091

Open practices statement The data used in this study are available at 
https:// osf. io/ nmr6h/ files/. The study was not preregistered.

Publisher’s note Springer Nature remains neutral with regard to 
jurisdictional claims in published maps and institutional affiliations.

https://scholarship.law.vanderbilt.edu/vlr/vol72/iss5/4
https://doi.org/10.1177/0146167201272002
https://doi.org/10.1177/0146167201272002
https://doi.org/10.1002/hrm.21852
https://doi.org/10.1002/hrm.21852
https://doi.org/10.3389/fpsyg.2022.789223
https://doi.org/10.3389/fpsyg.2022.789223
https://doi.org/10.1016/S0167-8116(02)00097-6
https://doi.org/10.1145/3170427.3188511
https://doi.org/10.1017/iop.2015.20
https://doi.org/10.1017/iop.2015.20
https://www.theatlantic.com/business/archive/2018/01/amazon-mechanical-turk/551192/
https://www.theatlantic.com/business/archive/2018/01/amazon-mechanical-turk/551192/
https://doi.org/10.1177/2167702612469015
https://doi.org/10.1080/03637751.2017.1342043
https://doi.org/10.1111/jasp.12580
https://doi.org/10.1016/j.tics.2017.06.007
https://doi.org/10.1177/1525822X08330261
https://doi.org/10.1