How many crowdsourced workers should a requester hire? (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs10472-015-9492-4.pdf

How many crowdsourced workers should a requester hire?

Ann Math Artif Intell (2016) 78:45–72 DOI 10.1007/s10472-015-9492-4 How many crowdsourced workers should a requester hire? Arthur Carvalho1 · Stanko Dimitrov2 · Kate Larson3 Published online: 6 January 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com Abstract Recent years have seen an increased interest in crowdsourcing as a way of obtaining information from a potentially large group of workers at a reduced cost. The crowdsourcing process, as we consider in this paper, is as follows: a requester hires a number of workers to work on a set of similar tasks. After completing the tasks, each worker reports back outputs. The requester then aggregates the reported outputs to obtain aggregate outputs. A crucial question that arises during this process is: how many crowd workers should a requester hire? In this paper, we investigate from an empirical perspective the optimal number of workers a requester should hire when crowdsourcing tasks, with a particular focus on the crowdsourcing platform Amazon Mechanical Turk. Specifically, we report the results of three studies involving different tasks and payment schemes. We find that both the expected error in the aggregate outputs as well as the risk of a poor combination of workers decrease as the number of workers increases. Surprisingly, we find that the optimal number of workers a requester should hire for each task is around 10 to 11, no matter the underlying task and payment scheme. To derive such a result, we employ a principled analysis based on bootstrapping and segmented linear regression. Besides the above result, we also find that Arthur Carvalho Stanko Dimitrov Kate Larson 1 Rotterdam School of Management, Erasmus University, Burgemester Oudlaan 50, 3062 PA Rotterdam, The Netherlands 2 Department of Management Sciences, University of Waterloo, 200 Universtiy Ave W., Waterloo, ON N2L 3G1, Canada 3 David R. Cheriton School of Computer Science, University of Waterloo, 200 Universtiy Ave W., Waterloo, ON N2L 3G1, Canada 46 A. Carvalho et al. overall top-performing workers are more consistent across multiple tasks than other workers. Our results thus contribute to a better understanding of, and provide new insights into, how to design more effective crowdsourcing processes. Keywords Crowdsourcing · Human computation · Amazon mechanical turk Mathematics Subject Classification (2010) 68T99 · 90B99 1 Introduction Recent technological advances have facilitated the outsourcing of a variety of tasks to “the crowd”, e.g., the decision support regarding various phases of managerial decision-making and problem solving [15], the design of advertisements [33], the development and testing of large software applications, the design of websites, professional translation of documents, transcription of audio, etc. Such a practice of obtaining relevant information or services from a large group of people, or outsourcing tasks to the crowd, is traditionally referred to as crowdsourcing. There are many different ways of outsourcing a task to the crowd. The crowdsourcing process we consider in this paper is as follows: a requester hires a number of crowd workers to work on a set of similar tasks. The term requester denotes an agent who wants to get the task solved, e.g., an institution, a researcher, etc. The underlying tasks are homogeneous in a sense that they are instances of the same class of tasks, e.g., content-analysis tasks, prediction tasks, and so on. Workers then work on the same set of tasks, but without formally communicating to each other. This is done to preserve the diversity of opinions throughout the process. After completing the tasks, each worker reports an output per task back to the requester. Outputs are context-dependent. For example, for prediction tasks, each output can be either a point estimate or a probability distribution over the plausible outcomes, whereas in sentiment-analysis tasks, the output is usually a score inside a discrete set representing how positive/negative the sentiment behind the underlying text is. After obtaining workers’ outputs, the requester then aggregates the reported outputs to obtain an aggregate output per task. We focus on averages when aggregating workers’ outputs, a simple, yet robust technique [14, 16]. Ideally, aggregate outputs are, in expectation, more accurate than any individual output. This is the basic premise behind the so called collective intelligence. A crucial question that arises during the above crowdsourcing process is: how many crowd workers should a requester hire? Or, less specifically, how does the number of workers influence the quality of the aggregate output? We first note that arguments can be made in favor and against the use of multiple workers. On the one hand, hiring multiple workers might bring diversity to the crowdsourcing process so that biases of individual judgments can offset each other, which might result in a more accurate aggregate output. On the other hand, a larger population of crowd workers might bring down the quality of aggregate outputs due to the likely inclusion of poor-quality workers. In this paper, we empirically investigate the above questions through a series of studies using a popular crowdsourcing platform: Amazon Mechanical Turk. Our studies differ from each other in terms of the underlying tasks and/or payment schemes. In our first study, we ask workers to solve three content-analysis tasks, and we pay workers per completed task. How many crowdsourced workers should a requester hire? 47 In our second study, we also ask workers to solve three content-analysis tasks, but their payments are based on the similarity of their reported outputs. In our third study, we ask workers to solve two prediction tasks, and we pay the workers using a proper scoring rule [42]. Due to the nature of the tasks in our studies, we are able to derive gold-standard outputs for each task, i.e., either ground-truth outputs or outputs of high quality provided by experts with relevant expertise. The existence of gold-standard outputs allows us to investigate how different combinations of workers affect the accuracy of aggregate outputs. In our first analysis, we find a substantial degree of improvement in expected accuracy as we increase the number of hired workers, with diminishing returns for extra workers. Moreover, the standard deviation of errors in the aggregate outputs decreases with more workers, which implies less risk when aggregating workers’ outputs. Our next contribution is a principled method for determining the optimal number of workers a requester should hire. Specifically, the proposed method combines bootstrapping with segmented linear regression analysis to determine the point at which hiring an extra worker has a negligible impact on the expected accuracy of the aggregate output. Surprisingly, we find in our studies that the optimal number of workers a requester should hire (...truncated)