The Anatomy of American Football: Evidence from 7 Years of NFL Game Data
RESEARCH ARTICLE
The Anatomy of American Football: Evidence
from 7 Years of NFL Game Data
Konstantinos Pelechrinis1*, Evangelos Papalexakis2
1 School of Information Sciences, University of Pittsburgh, Pittsburgh, PA, United States of America,
2 Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, United
States of America
*
Abstract
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Pelechrinis K, Papalexakis E (2016) The
Anatomy of American Football: Evidence from 7
Years of NFL Game Data. PLoS ONE 11(12):
e0168716. doi:10.1371/journal.pone.0168716
Editor: Kimmo Eriksson, Mälardalen University,
SWEDEN
Received: July 23, 2016
Accepted: November 23, 2016
Published: December 22, 2016
Copyright: © 2016 Pelechrinis, Papalexakis. This is
an open access article distributed under the terms
of the Creative Commons Attribution License,
which permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All relevant data are
available within the manuscript and deposited in
Github: https://github.com/kpelechrinis/
footballonomics.
Funding: The author(s) received no specific
funding for this work.
Competing Interests: The authors have declared
that no competing interests exist.
How much does a fumble affect the probability of winning an American football game? How
balanced should your offense be in order to increase the probability of winning by 10%?
These are questions for which the coaching staff of National Football League teams have a
clear qualitative answer. Turnovers are costly; turn the ball over several times and you will
certainly lose. Nevertheless, what does “several” mean? How “certain” is certainly? In this
study, we collected play-by-play data from the past 7 NFL seasons, i.e., 2009–2015, and we
build a descriptive model for the probability of winning a game. Despite the fact that our
model incorporates simple box score statistics, such as total offensive yards, number of
turnovers etc., its overall cross-validation accuracy is 84%. Furthermore, we combine this
descriptive model with a statistical bootstrap module to build FPM (short for Football Prediction Matchup) for predicting future match-ups. The contribution of FPM is pertinent to its simplicity and transparency, which however does not sacrifice the system’s performance. In
particular, our evaluations indicate that our prediction engine performs on par with the current state-of-the-art systems (e.g., ESPN’s FPI and Microsoft’s Cortana). The latter are typically proprietary but based on their components described publicly they are significantly
more complicated than FPM. Moreover, their proprietary nature does not allow for a head-tohead comparison in terms of the core elements of the systems but it should be evident that
the features incorporated in FPM are able to capture a large percentage of the observed variance in NFL games.
1 Introduction
While American football is viewed mainly as a physical game—and it surely is—at the same
time it is probably one of the most strategic sports games, a fact that makes it appealing even to
an international crowd [1]. This has led to people analyzing the game with the use of data analytics methods and game theory. For instance, after the controversial last play call of Super
Bowl XLIX the Economist [2] argued by utilizing appropriate data and game theory that this
play was rational and not that bad after all.
PLOS ONE | DOI:10.1371/journal.pone.0168716 December 22, 2016
1 / 17
The Anatomy of American Football
The ability to analyze and collect large volumes of data has put forward a quantificationbased approach in modeling and analyzing the success in various sports during the last few
years. For example, pertinent to American football, Clark et al. [3] analyzed the factors that
affect the success of a field goal kick and contrary to popular belief they did not identify any situational factor (e.g., regular vs post season, home vs away etc.) as being significant. In another
direction Pfitzner et al. [4] and Warner [5] studied models and systems for determining a successful betting strategy for NFL games, while the authors in [6] show that the much-discussed
off-field misconduct of NFL players does not affect a team’s performance. Furthermore, the
spatial information collected from the RFID sensors on NFL players has been used to evaluate
quarterbacks’ decision making ability [7], while efforts to assess the impact of individual offensive linemen on passing have been presented by Alamar and Weinstein-Gould [8]. Similarly,
Correia et al. [9] analyzed the passing behavior of rugby players—the most similar sport to that
of American football. They found that the time required to close the gap between the first
attacker and the defense explained 64% of the variance found in pass duration and this can further yield information about future pass possibilities. Nevertheless, despite the availability of
play data for American football and the proliferation of the sports analytics literature as well as
the literature surrounding the NFL, there are only few—publicly open—studies that have
focused on predicting a game’s outcome. Furthermore, some of the existing models make
strong theoretical assumptions that are hard to verify (e.g., the team strength factors obeying
to a first-order autoregressive process [10]). Close with our work, Cohea and Payton developed
a logistic regression model to understand the factors affecting an NFL game outcome [11]. The
benefit of our model as compared to the one presented by Cohea and Payton [11] is that the
number of exploratory variables we are using is much smaller, making it easy for a fan to follow. Most importantly though we combine our model with statistical bootstrap in order to
facilitate future game predictions (something that the model presented in [11] is not able to
perform). Of course, predictive models for NFL games have been developed by major sports
networks. For example ESPN has developed the Football Power Index, which is used to make
probabilistic predictions for upcoming matchups [12]. Software companies have also developed their own models (e.g., Cortana from Microsoft [13]). Nevertheless, these models are
proprietary and are not open to the public.
In this study we are first interested in providing a simple model that is able to quantify the
impact of various factors on the probability of wining a game of American football. How much
does a turnover affect a team’s probability of winning? Can you really win a game after having
turned the ball over 5 times? While coaches and players know the qualitative answer to similar
questions, the goal of our work is to provide a quantitative answer. For this purpose we use
play-by-play data for the last seven seasons of the National Football League (i.e., between 2009
and 2015) and we extract specific team stat (...truncated)