Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data
Citation: Mestyan M, Yasseri T, Kert esz J (
Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data
Ma rton Mestya n 0
Taha Yasseri 0
Ja nos Kerte sz 0
Attila Szolnoki, Hungarian Academy of Sciences, Hungary
0 1 Institute of Physics, Budapest University of Technology and Economics, Budapest, Hungary, 2 Oxford Internet Institute, University of Oxford , Oxford , United Kingdom , 3 Department of Biomedical Engineering and Computational Science, Aalto University , Aalto , Finland , 4 Center for Network Science, Central European University , Budapest , Hungary
Use of socially generated ''big data'' to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between ''real time monitoring'' and ''early predicting'' remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.
-
Funding: Partial financial support from EUs 7th Framework Programs FET-Open to ICTeCollective project no. 238597 and by the Academy of Finland, the
Finnish Center of Excellence program, project no. 129670, and TEKES (FiDiPro) are gratefully acknowledged. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Living in the digital world of today, along with all the
advantages also has its side effects and byproducts. Our daily life
nowadays leaves a digital trace of all our activities in the recently
developed Information and Communications Technology based
environments. Our social communications through different
digital channels, financial activities within e-commerce, physical
locations registered by cell phone providers etc., are traced and
recorded. In addition to such passive collection of data about
online activity, we also actively share information about our
feelings, emotional moods, opinions and views through the so
called Web 2.0. or user generated content within social media. In
addition to providing us with novel answers to classic questions
about individual and social aspects of human life from scientific
point of view, precise analysis of this huge amount of data can
have practical applications to predict, monitor, and cope with
many different type of events, from simple matters of daily life to
massive crises in the global scale. For example, Sakaki et al. have
developed an alerting system based on Tweets (posts in the Twitter
microblogging service), being able to detect earthquakes almost in
real time [1]. They elaborate their detection system further to
detect rainbows in the sky, and traffic jams in cities [2]. The
practical point of their work is that the alerting system could
perform so promptly that the alert message could arrive faster than
the earthquake waves to certain regions. Bollen et al. have
analyzed moods of Tweets and based on their investigations they
could predict daily up and down changes in Dow Jones Industrial
Average values with an accuracy of 87.6% [3]. Saavedra et al.
investigated the relationship between the content of traders
messages and market dynamics. They show that there is a positive
correlation between the usage of bundles of positive and
negative words with agents overall financial performance [4].
Another example is using Twitter to predict electoral outcomes
[5], however with its biases and limitations [6,7]. Interesting
studies have appeared treating the use of social media indicators to
predict the scientific impact of research articles, e.g., short-term
web usage (number of downloads from the pre-print sharing web
site arXiv) [8] and Twitter mentions [9]. In a recent work, it is
shown that Twitter mentions and arXiv downloads follow two
distinct temporal patterns of activity, however, the volume of
Twitter mentions is statistically correlated with arXiv downloads
and early citations [10]. Preis et al. found a correlation between
weekly transaction volumes of S&P 500 companies and weekly
Google search volumes of corresponding company names [11]. By
analyzing search queries for information about preceding and
following years, a striking correlation between a countrys GDP
and the predisposition of its inhabitants to look forward is observed
[12]. Based on Google search logs, Ginsberg et al. estimated the
spread of influenza in the United States [13]. There are other
examples of using social media streams to make predictions on
news popularity in terms of the number of user-generated
comments [14,15] or the number of news visitors [16]. For a
comprehensive literature review see [17].
Statistical analysis of motion picture markets has led to
intriguing results, such as observing the evidence for a Pareto
law for movie income [18,19] along with a log-normal distribution
of the gross income per theater and a bimodal distribution of the
number of theaters in which a movie is shown [20]. By analyzing
historical data about 70 years of the American movie market,
Sreenivasan has argued that the movies with higher level of
novelty (assigned based on keywords from the Internet Movie
Database) produce larger revenue [21]. Despite much effort with
different approaches, predicting the financial success of a movie
remains a challenging open problem. For example, Sharda and
Delen have trained a neural network to process pre-release data,
such as quality and popularity variables, and classify movies into
nine categories according to their anticipated income, from flop
to blockbuster. For test samples, the neural network classifies
only 36.9% of the movies correctly, while 75.2% of the movies are
at most one category away from correct [22]. Joshi et al. have built
a multivariate linear regression model that joined meta-data with
text features from pre-release critiques to predict the revenue with
a coefficient of determination R2~0:671 [23]. Since predictions
based on classic quality factors fail to reach a level of accuracy high
enough for practical application, usage of user-generated data to
predict the success of a movie becomes a very tempting approach.
Ishii et al. present a mathematical framework for the spread of
popularity in society [24]. Their model, which takes the
advertisement budget as an input parameter and generates a
dynamic popularity variable, is validated against the number of
blog posts on the particular movies in the Japanese Blogosphere.
I (...truncated)