A Systemic Functional Approach To Automated Authorship Analysis
Journal of Law and Policy
Volume 21
Issue 2
SYMPOSIUM:
Authorship Attribution Workshop
Article 3
2013
A Systemic Functional Approach To Automated
Authorship Analysis
Shlomo Argamon, Ph.D
Moshe Koppel, Ph.D.
Follow this and additional works at: https://brooklynworks.brooklaw.edu/jlp
Recommended Citation
Shlomo Argamon, Ph.D & Moshe Koppel, Ph.D., A Systemic Functional Approach To Automated Authorship Analysis, 21 J. L. & Pol'y
(2013).
Available at: https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/3
This Article is brought to you for free and open access by the Law Journals at BrooklynWorks. It has been accepted for inclusion in Journal of Law and
Policy by an authorized editor of BrooklynWorks.
A SYSTEMIC FUNCTIONAL APPROACH TO
AUTOMATED AUTHORSHIP ANALYSIS
Shlomo Argamon* and Moshe Koppel**
INTRODUCTION
Attribution of anonymous texts, if not based on factors
external to the text (such as paper and ink type or document
provenance, as used in forensic document examination), is
largely, if not entirely, based on considerations of language
style. We will consider here the question of how to best
deconstruct a text into quantitative features for purposes of
stylistic discrimination. Two key considerations inform our
analysis. First, such features should support accurate
classification by automated methods. Second, and no less
importantly, such features should enable a clear explanation of
the stylistic difference between stylistic categories (read:
authors) and why a disputed text appears more likely to fall into
one or another category. The latter consideration is particularly
important when a nonexpert, such as a judge or jury, must
evaluate the results and reliability of the analysis.
We start from the intuitive notion that style is indicated in a
text by those features of the text that indicate the author’s choice
of one mode of expression from among a set of equivalent
modes for a given content. There are many ways in which such
choices manifest themselves in a text. Specific words and
phrases may be chosen more frequently by certain authors than
others, such as the phrase “cool-headed logician” favored by the
Unabomber. Some authors may habitually use certain syntactic
* Linguistic Cognition Laboratory, Department of Computer Science, Illinois
Institute of Technology, .
** Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel,
.
299
JOURNAL OF LAW AND POLICY
300
constructions more frequently, as in Hemingway’s preference
for short, simple clauses. Differences between authors will also
arise at the level of the organization of the text as a whole, as
some people may prefer to make reasoned arguments from
evidence to conclusions, and others may prefer emotional
appeals organized differently.
However, all of these “surface” linguistic phenomena have
multiple potential underlying causes, not only authorship. They
include the genre, register, and purpose of the text as well as the
educational background, social status, and personality of the
author and audience.1 What all these dimensions of variation
have in common, though, is independence, to a greater or lesser
extent, of the “topic” of the text. Hence the traditional focus in
computational authorship attribution on features such as function
word usage; vocabulary richness and complexity measures; and
frequencies of different syntactic structures; which are
essentially nonreferential.
Early statistical attribution techniques relied on relatively
small numbers of such features, while developments in machine
learning and computational linguistics over the last fifteen to
twenty years have enabled larger numbers of features to be
generated for stylistic analysis. However, in almost no case is
there strong theoretical motivation behind the input feature sets,
such that the features have clear interpretations in stylistic terms.
We argue, however, that without a firm basis in a linguistic
theory of meaning (not just of syntax), we are unlikely to gain
any true insight into the nature of any stylistic distinction being
studied. Such understanding is key to both establishing and
explaining evidence for a proposed attribution. Otherwise, an
attribution method is merely a black box that may appear to
work for extrinsic or accidental reasons but not actually give
reliable results in a given case. Furthermore, an attribution
method that produces insight into the relevant language variation
is more likely to be useful and accepted in a forensic context, all
else being equal, as the judge and jury will be better able to
understand the results.
1
DOUGLAS BIBER & SUSAN CONRAD, REGISTER, GENRE, AND STYLE (P.
Austin et al. eds., 2009).
AUTOMATED AUTHORSHIP ANALYSIS
301
We therefore sketch here a computationally tractable
formulation of linguistically and stylistically well-motivated
features we have developed that permits text classification based
on specific variation in choice of nonreferential meanings. The
system produces meaningful information about the stylistic
distinctions being analyzed, which can be used for interpretative
and forensic purposes. We will explain our methodology and
then use it as a case study for what any such methodology
should provide.
Before we begin, it is worth briefly surveying the variety of
problems that fall under the umbrella of “authorship analysis.”
The simplest form of the problem is where an anonymous
document is potentially attributable to one of a relatively small
number (two to fifty, or so) of suspects. The question is then
simply which of the suspects has a writing style most like that of
the anonymous document. More difficult (and much more likely
in the real world) is the case where the document might not be
authored by any of the suspects at all—in this case we must be
able to determine that the document is not enough like any of
the suspects to attribute authorship. The hardest version of this
scenario is authorship verification, where the question is whether
a single suspect did or did not author the anonymous document.
All such authorship attribution scenarios assume a known set of
suspects who are being evaluated for authorship of the
questioned document. We require some quantity of texts written
by each of the suspects to determine authorship. On the other
hand, if, as is often the case in police investigations, specific
suspects are not known, we must consider the task of authorship
profiling, determining as much about the author as possible,
based upon clues in the document. As we will discuss below, a
number of personal characteristics of an author can be reliably
estimated from stylistic cues in a document. But first we will
consider generally how we can quantitatively characterize the
style of a text for computational analysis.
JOURNAL OF LAW AND POLICY
302
I. FUNCTIONAL LEXICAL FEATURES
Our methodology is based on Halliday’s Systemic Functional
Grammar2 (“SFG”), which we find to be particularly well-suited
to the sort of computational anal (...truncated)