How many crowdsourced workers should a requester hire?
Ann Math Artif Intell (2016) 78:45–72
DOI 10.1007/s10472-015-9492-4
How many crowdsourced workers should a requester
hire?
Arthur Carvalho1 · Stanko Dimitrov2 · Kate Larson3
Published online: 6 January 2016
© The Author(s) 2016. This article is published with open access at Springerlink.com
Abstract Recent years have seen an increased interest in crowdsourcing as a way of
obtaining information from a potentially large group of workers at a reduced cost. The
crowdsourcing process, as we consider in this paper, is as follows: a requester hires a number of workers to work on a set of similar tasks. After completing the tasks, each worker
reports back outputs. The requester then aggregates the reported outputs to obtain aggregate outputs. A crucial question that arises during this process is: how many crowd workers
should a requester hire? In this paper, we investigate from an empirical perspective the optimal number of workers a requester should hire when crowdsourcing tasks, with a particular
focus on the crowdsourcing platform Amazon Mechanical Turk. Specifically, we report the
results of three studies involving different tasks and payment schemes. We find that both the
expected error in the aggregate outputs as well as the risk of a poor combination of workers
decrease as the number of workers increases. Surprisingly, we find that the optimal number
of workers a requester should hire for each task is around 10 to 11, no matter the underlying
task and payment scheme. To derive such a result, we employ a principled analysis based on
bootstrapping and segmented linear regression. Besides the above result, we also find that
Arthur Carvalho
Stanko Dimitrov
Kate Larson
1
Rotterdam School of Management, Erasmus University, Burgemester Oudlaan 50, 3062 PA
Rotterdam, The Netherlands
2
Department of Management Sciences, University of Waterloo, 200 Universtiy Ave W., Waterloo,
ON N2L 3G1, Canada
3
David R. Cheriton School of Computer Science, University of Waterloo, 200 Universtiy Ave W.,
Waterloo, ON N2L 3G1, Canada
46
A. Carvalho et al.
overall top-performing workers are more consistent across multiple tasks than other workers. Our results thus contribute to a better understanding of, and provide new insights into,
how to design more effective crowdsourcing processes.
Keywords Crowdsourcing · Human computation · Amazon mechanical turk
Mathematics Subject Classification (2010) 68T99 · 90B99
1 Introduction
Recent technological advances have facilitated the outsourcing of a variety of tasks to “the
crowd”, e.g., the decision support regarding various phases of managerial decision-making
and problem solving [15], the design of advertisements [33], the development and testing of
large software applications, the design of websites, professional translation of documents,
transcription of audio, etc. Such a practice of obtaining relevant information or services
from a large group of people, or outsourcing tasks to the crowd, is traditionally referred to
as crowdsourcing.
There are many different ways of outsourcing a task to the crowd. The crowdsourcing
process we consider in this paper is as follows: a requester hires a number of crowd workers
to work on a set of similar tasks. The term requester denotes an agent who wants to get the
task solved, e.g., an institution, a researcher, etc. The underlying tasks are homogeneous
in a sense that they are instances of the same class of tasks, e.g., content-analysis tasks,
prediction tasks, and so on. Workers then work on the same set of tasks, but without formally
communicating to each other. This is done to preserve the diversity of opinions throughout
the process.
After completing the tasks, each worker reports an output per task back to the requester.
Outputs are context-dependent. For example, for prediction tasks, each output can be
either a point estimate or a probability distribution over the plausible outcomes, whereas
in sentiment-analysis tasks, the output is usually a score inside a discrete set representing
how positive/negative the sentiment behind the underlying text is. After obtaining workers’
outputs, the requester then aggregates the reported outputs to obtain an aggregate output
per task. We focus on averages when aggregating workers’ outputs, a simple, yet robust
technique [14, 16]. Ideally, aggregate outputs are, in expectation, more accurate than any
individual output. This is the basic premise behind the so called collective intelligence.
A crucial question that arises during the above crowdsourcing process is: how many
crowd workers should a requester hire? Or, less specifically, how does the number of workers influence the quality of the aggregate output? We first note that arguments can be made
in favor and against the use of multiple workers. On the one hand, hiring multiple workers
might bring diversity to the crowdsourcing process so that biases of individual judgments
can offset each other, which might result in a more accurate aggregate output. On the other
hand, a larger population of crowd workers might bring down the quality of aggregate
outputs due to the likely inclusion of poor-quality workers.
In this paper, we empirically investigate the above questions through a series of studies
using a popular crowdsourcing platform: Amazon Mechanical Turk. Our studies differ from
each other in terms of the underlying tasks and/or payment schemes. In our first study, we
ask workers to solve three content-analysis tasks, and we pay workers per completed task.
How many crowdsourced workers should a requester hire?
47
In our second study, we also ask workers to solve three content-analysis tasks, but their
payments are based on the similarity of their reported outputs. In our third study, we ask
workers to solve two prediction tasks, and we pay the workers using a proper scoring rule
[42].
Due to the nature of the tasks in our studies, we are able to derive gold-standard outputs for each task, i.e., either ground-truth outputs or outputs of high quality provided by
experts with relevant expertise. The existence of gold-standard outputs allows us to investigate how different combinations of workers affect the accuracy of aggregate outputs. In
our first analysis, we find a substantial degree of improvement in expected accuracy as we
increase the number of hired workers, with diminishing returns for extra workers. Moreover, the standard deviation of errors in the aggregate outputs decreases with more workers,
which implies less risk when aggregating workers’ outputs.
Our next contribution is a principled method for determining the optimal number of
workers a requester should hire. Specifically, the proposed method combines bootstrapping
with segmented linear regression analysis to determine the point at which hiring an extra
worker has a negligible impact on the expected accuracy of the aggregate output. Surprisingly, we find in our studies that the optimal number of workers a requester should hire (...truncated)