Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0071226&type=printable

Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data

Citation: Mestyan M, Yasseri T, Kert esz J ( Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data Ma rton Mestya n 0 Taha Yasseri 0 Ja nos Kerte sz 0 Attila Szolnoki, Hungarian Academy of Sciences, Hungary 0 1 Institute of Physics, Budapest University of Technology and Economics, Budapest, Hungary, 2 Oxford Internet Institute, University of Oxford , Oxford , United Kingdom , 3 Department of Biomedical Engineering and Computational Science, Aalto University , Aalto , Finland , 4 Center for Network Science, Central European University , Budapest , Hungary Use of socially generated ''big data'' to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between ''real time monitoring'' and ''early predicting'' remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia. - Funding: Partial financial support from EUs 7th Framework Programs FET-Open to ICTeCollective project no. 238597 and by the Academy of Finland, the Finnish Center of Excellence program, project no. 129670, and TEKES (FiDiPro) are gratefully acknowledged. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Living in the digital world of today, along with all the advantages also has its side effects and byproducts. Our daily life nowadays leaves a digital trace of all our activities in the recently developed Information and Communications Technology based environments. Our social communications through different digital channels, financial activities within e-commerce, physical locations registered by cell phone providers etc., are traced and recorded. In addition to such passive collection of data about online activity, we also actively share information about our feelings, emotional moods, opinions and views through the so called Web 2.0. or user generated content within social media. In addition to providing us with novel answers to classic questions about individual and social aspects of human life from scientific point of view, precise analysis of this huge amount of data can have practical applications to predict, monitor, and cope with many different type of events, from simple matters of daily life to massive crises in the global scale. For example, Sakaki et al. have developed an alerting system based on Tweets (posts in the Twitter microblogging service), being able to detect earthquakes almost in real time [1]. They elaborate their detection system further to detect rainbows in the sky, and traffic jams in cities [2]. The practical point of their work is that the alerting system could perform so promptly that the alert message could arrive faster than the earthquake waves to certain regions. Bollen et al. have analyzed moods of Tweets and based on their investigations they could predict daily up and down changes in Dow Jones Industrial Average values with an accuracy of 87.6% [3]. Saavedra et al. investigated the relationship between the content of traders messages and market dynamics. They show that there is a positive correlation between the usage of bundles of positive and negative words with agents overall financial performance [4]. Another example is using Twitter to predict electoral outcomes [5], however with its biases and limitations [6,7]. Interesting studies have appeared treating the use of social media indicators to predict the scientific impact of research articles, e.g., short-term web usage (number of downloads from the pre-print sharing web site arXiv) [8] and Twitter mentions [9]. In a recent work, it is shown that Twitter mentions and arXiv downloads follow two distinct temporal patterns of activity, however, the volume of Twitter mentions is statistically correlated with arXiv downloads and early citations [10]. Preis et al. found a correlation between weekly transaction volumes of S&P 500 companies and weekly Google search volumes of corresponding company names [11]. By analyzing search queries for information about preceding and following years, a striking correlation between a countrys GDP and the predisposition of its inhabitants to look forward is observed [12]. Based on Google search logs, Ginsberg et al. estimated the spread of influenza in the United States [13]. There are other examples of using social media streams to make predictions on news popularity in terms of the number of user-generated comments [14,15] or the number of news visitors [16]. For a comprehensive literature review see [17]. Statistical analysis of motion picture markets has led to intriguing results, such as observing the evidence for a Pareto law for movie income [18,19] along with a log-normal distribution of the gross income per theater and a bimodal distribution of the number of theaters in which a movie is shown [20]. By analyzing historical data about 70 years of the American movie market, Sreenivasan has argued that the movies with higher level of novelty (assigned based on keywords from the Internet Movie Database) produce larger revenue [21]. Despite much effort with different approaches, predicting the financial success of a movie remains a challenging open problem. For example, Sharda and Delen have trained a neural network to process pre-release data, such as quality and popularity variables, and classify movies into nine categories according to their anticipated income, from flop to blockbuster. For test samples, the neural network classifies only 36.9% of the movies correctly, while 75.2% of the movies are at most one category away from correct [22]. Joshi et al. have built a multivariate linear regression model that joined meta-data with text features from pre-release critiques to predict the revenue with a coefficient of determination R2~0:671 [23]. Since predictions based on classic quality factors fail to reach a level of accuracy high enough for practical application, usage of user-generated data to predict the success of a movie becomes a very tempting approach. Ishii et al. present a mathematical framework for the spread of popularity in society [24]. Their model, which takes the advertisement budget as an input parameter and generates a dynamic popularity variable, is validated against the number of blog posts on the particular movies in the Japanese Blogosphere. I (...truncated)