Am I who I say I am? Unobtrusive self-representation and personality recognition on Facebook
Am I who I say I am? Unobtrusive self- representation and personality recognition on Facebook
Margeret Hall 0 1
Simon Caton 1
☯ These authors contributed equally to this work. 1
0 School of Interdisciplinary Informatics, University of Nebraska at Omaha, Omaha, United States of America, 2 School of Computing, National College of Ireland , Dublin , Ireland
1 Editor: Feng Xia, Dalian University of Technology , CHINA
Across social media platforms users (sub)consciously represent themselves in a way which is appropriate for their intended audience. This has unknown impacts on studies with unobtrusive designs based on digital (social) platforms, and studies of contemporary social phenomena in online settings. A lack of appropriate methods to identify, control for, and mitigate the effects of self-representation, the propensity to express socially responding characteristics or self-censorship in digital settings, hinders the ability of researchers to confidently interpret and generalize their findings. This article proposes applying boosted regression modelling to fill this research gap. A case study of paid Amazon Mechanical Turk workers (n = 509) is presented where workers completed psychometric surveys and provided anonymized access to their Facebook timelines. Our research finds indicators of self-representation on Facebook, facilitating suggestions for its mitigation. We validate the use of LIWC for Facebook personality studies, as well as find discrepancies with extant literature about the use of LIWC-only approaches in unobtrusive designs. Using survey data and LIWC sentiment categories as predictors, the boosted regression model classified the Five Factor personality model with an average accuracy of 74.6%. The contribution of this work is an accurate prediction of psychometric information based on short, informal text.
Competing interests: The authors have declared
that no competing interests exist.
Across platforms like Facebook, LinkedIn, Twitter, and blogging services, users
(sub)consciously represent themselves in a way which is appropriate for their intended audience [1±5].
However, researchers have not yet adequately addressed controlling for self-representation,
the propensity to display socially responding characteristics or effects of self-censorship in
online settings [
]; including online social network platforms. The trove of potential online
social media data is vast, but the ability of researchers identifying ground truth models, and
thus to verify its authenticity, is low. This can result in misleading or wrong analyses [7±10].
As such, researchers on these platforms risk working with `gamified,' or socially responding
personas that go beyond efforts to contain Common Method Biases (CMB) in research design
]. This leaves the open question of alignment of unobtrusively gathered online data and
self-reported data. In this paper, we focus on the alignment of survey methods with
unobtrusive methods of gathering data from online social media.
This article has two aims:
· To explore the relationship between offline and online personalities via survey responses
and self-produced text such that;
· Participant-influenced biases in publically sourced data can be mitigated.
In response to these research aims, we hypothesize that self-representation can be identified
by test-based attributes (Section 2) and describe a mechanism to do so in the context of
Facebook studies. For this, we employed the popular crowdwork platform Amazon Mechanical
Turk, receiving survey responses and anonymous Facebook Timeline data from 509 workers
(Section 3). Following on from the identification of self-representation, we discuss how it can
be controlled for in broad social models (Section 4). Section 5 then discusses the implications
of this work and summarizes the contribution, limitations, and points out areas for future
work (Section 6).
Self-representation has been discussed in several works for online and offline fora. These
studies discuss that one's tendency to truthfully disclose or censor personal information
emanates from an associated intrinsic value [13±18]. Many methods including surveys,
interviews, and (n)ethnographic research can identify self-representation from the first
person perspective. Sentiment analysis is a promising research design for the unobtrusive
identification and mitigation of self-representation bias in data at a lower overall cost [19±22].
Whilst the phenomenon of representation of self is across all social media, Facebook lends
itself well for conducting such analyses as it is larger and has a higher upper bound of
characters per post than its major competitor Twitter [
], and Facebook generally has set
audience boundaries [
Presentation of self in online social networks. We define self-representation in
accordance with Goffman [
] as controlling or guiding the impression others could make by
altering the posters' settings, appearance and manner. Goffman's work was extended for digital
fora by [
]. Both Hogan  and boyd and colleagues [
] contend self-representation is an
increasingly frequent strategy in online participation and communication. In the view of
Goffman and Hogan self-representation is the display of the scenario-based ideal self, rather than a
pattern of deception. This view was extended by [
], who finds that self-representation online
can be for expressive, communicative, or promotional purposes. However, in contrast to the
work by Van Dijck [
], we define self-representation as distinct from the concept of identity
], where self-representation is the presentation of a scenario-based idealized
self and identity contingencies is the staging of a social identity marker (e.g., being a computer
scientist, being from the United States) in order to highlight communal (dis)similarities.
Online self-representation can be employed on social media with text typed, photos posted,
emojis used, and presence/absence of group identities (among other displayed attributes).
Self-representation is also bound to time and place. In real life one must immediately
respond to an interlocutor or opponent. In social networks, one has the option not to act
immediately. This is even true in the case of messaging platforms using delivered/read
notifications (i.e., Facebook, Whatsapp). Even these types of sites deliver notifications of messages to
the front page or screen of the interface, thus allowing the user to opt to respond at a time of
2 / 23
their choice. Local binding is functionally eliminated with online social networks [
real life direct communication is often the social norm  whereas in social networks
communication is more indirect. Status updates, uploading pictures, or inserting information in
the "About Me" section is not directed to anyone specifically. Although one approximately
knows who may be reached, it is not known who will respond [
Individuals self-represent due to an increase in intrinsic value [
]. Across studies,
honesty in online representation is valued but ability and application of self-representation online
has attractive socially-reinforced benefits. Qualitative interviews (n = 100) on internet dating
found that the potential for self-representation is an attractive attribute of online activities
. A contradicting study by [
] considered an online dating environment in order to
determine the extent of self-representation by users. Results of their interviews (n = 34) indicate
that the users who are more `honest' in self-presentation have more success in dating.
Nonetheless, all interviewees noted that in their online dating profiles they attempt to reveal
themselves particularly positively, and have the same impression of the profile construction of other
] describe self-representation as self-monitoring, defined as the construction of a
publically presented self for social interactions in their 116-person study. [
] define high
selfmonitors as those who carefully curate their self-presentation and low self-monitors as those
who are less guarded by portraying their `real' selves. They find that high self-monitors are
more likely to occupy preferential positions and have higher social network density than low
self-monitors, measures of the relative success of a self-representation strategy and popularity
There is still open debate on the extent of self-representation online. For example, online
self-representation was challenged by [
], who find that posters describe extensions of their
actual lives in their survey and nethnography of 133 Americans and 103 Germans. In a
literature review, [
] argue that self-representation is contextual. Most people use Facebook to stay
in touch with people met offline, so they cannot completely detach their true identity [
Utz and colleagues established in their twinned studies of 255 and 198 Dutch participants that
users shorten self-descriptions to make themselves seem more interesting. When the audience
is likely to be unknown, users try to present a socially aspired self-image to be `popular' .
Emotional disclosure on Facebook. Studies show that honest self-disclosure is generally
more emphasized in real life and is different online [
] measured 185 then 37
participants in two studies, discovering that users communicate their positive emotions online more
frequently via social posturing, finding that negative emotions in Facebook are hardly
communicated. When negative (and positive) emotions are used, they tend to cluster around users
]. The intensity of positive emotion disclosure is often linked to one's
extraversion or neuroticism levels as measured on the Five Factor personality model of . Extraverts
have been found to express significantly higher frequencies of positive emotions [35±37].
Facebook's study on self-disclosure, the typing then editing, deleting, or posting of statuses
and comments from 3.9 million Facebook users, found that 71% of users self-censor in some
way. Males censor more than female, and Facebook posts are more frequently regulated than
comments. They find that those with higher boundaries (estimated by the amount of
regulations on visibility in place for a given audience member of the posting person) self-censor
more, and theorize that lack of control drives self-censorship. Given that perceived lack of
control is a characteristic of neurotic personalities [38±40], active self-censoring can be
understood as an expression of neuroticism on social media.
Linguistic Inquiry and Word Count (LIWC). This section concentrates on the
properties and related finding of the text analysis package Linguistic Inquiry and Word Count
(LIWC). This review is not extensive, and does not cover the multiple non-LIWC tools
available to measure computational affect, psychometrics, and sentiment analysis. LIWC's premise
3 / 23
is that it is function and not context of the word that matters. Latent emotional and
psychological states are revealed by word function more than the words actually in use. Function words
comprise approximately 55% of a given language and are difficult and expensive to manipulate
]. Function words can detect emotional states [42±45], predict psychometrics [
well as gender and age [
]. LIWC has been applied to predict deception [
], and its output
has proven to outperform humans when detecting dishonest writing samples . LIWC
shown excellent precision and recall capacities with high but not overfitting correlations in the
analysis of latent sentiment [
]. A number of studies discuss correlations between LIWC
and personality as well as attempt prediction tasks based on the same [35,41,48,53±55]. Until
now it has been found that machine learning approaches often perform better than
LIWConly approaches in prediction tasks [
Recent criticisms of LIWC's fundamental approach suggest two problems: LIWC has yet to
be thoroughly validated for different mediums of online social media data [
emerging studies report low correlation strength between existing scales or survey responses and
online social media data . Comparison studies by [
] found that LIWC and
LIWCbased dictionaries (e.g., SentiStrength) had high levels of precision, word recognition, and
agreement as well as good prediction accuracy . In general, these studies reported that
LIWC was among the top of the ranks of all tools tested for the metrics named above. This is
likely due to LIWC's focus on latent sentiment: It is more difficult to manipulate the latent
emotional function and state of a word than actual word use [
Benchmark studies on personality and Facebook using LIWC. Two studies closely
match the approach of this work and are elucidated here. The initial study applying LIWC to
assess personality traits from online discourse is the work [
]. Yarkoni evaluated word usage
and personality traits of 694 bloggers using LIWC 2001's 66 categories (linguistic categories
minus non-semantic categories). He employed a correlation analysis of all LIWC categories
and the Five Factor Model with a False Discovery Rate criterion of 0.05. This work found
strong correlations across and between LIWC and the Five Factor Model. His work reports a
full feature vector of each LIWC correlation with the respective personality trait.
The work [
] also considers the interaction between personality as displayed by Facebook
writing samples and LIWC. Schwartz and colleagues extracted Facebook data of 75,000
participants, analysing a corpus of 700 million words. They employed three techniques to predict the
gender, age, and personality of participants. Firstly, they employed LIWC as a stand-alone
tool. They compared this to the open vocabulary approach (a combination of words and
ngrams) and a topics-based approach. Each technique was combined to evaluate the predictive
power, using a Bonferroni correction in their evaluations. Schwartz and colleagues reported
on a word and phrase basis the indicators of personality, age and gender as compared to
Yarkoni, who reported LIWC categories. They report that gender can be predicted with between
78.4±91.9% accuracy. They report the explained variance but not prediction accuracy of the
Five Factor Model.
This work differs on several aspects. We utilize regression modelling instead of correlations
for our reporting as opposed to the Yarkoni work. However, our models are built to respect
the high variable-to-predictor ratio, thus use boosted models (see Statistical Modelling for
more details) which is a difference in approach to the work of Schwartz and colleagues.
We report word categories in the style of [
] rather than individual words in comparison
]. We argue that by following the dictionary-label approach we aid replicability of the
study. LIWC's dictionaries are curated and updated fairly regularly, meaning that words falling
into these dictionaries will generally be recognized. By using only words and not classifiers,
researchers run the risk of particular words or phrases falling out of usage in online language.
In this case, the word-based approach would no longer be replicable.
4 / 23
We note that many studies exist in literature that are not analysed in depth here (see, e.g.,
[61±64]). These studies generally employ open language approaches [
] as opposed to our
concentration on the LIWC package, or employ regression modelling without enhancements
from the machine learning domain as are employed in this work [
Materials and methods
Personality as a tool to detect self-representation on Facebook
Given the status of the literature, an interesting question is raised on the unknown interaction
between personality types, posting on Facebook, and propensity for self-representation. A link
between online self-representation and real-life personality has neither been definitively
addressed in cyberpsychology nor sentiment analytics literature [
] on Facebook.
Personality is good basis for the identification of self-representation due to its known
relationships in on- and offline fora [
] and stability [
]. Based on the findings of
] we assume that personality is identifiable from online social media data, and
that these traits can be isolated with the LIWC package. H1 and H2 support that, and serve as
the expected literature-based benchmarks. H1a/b and H2a/b consider the current literature
based discussions and further hypothesize that:
H1 Self-representation is characterized by withdrawing or enhancing psychometric
characteristics on Facebook.
H1a Positivity bias (enhanced positivity and withdrawn negativity) is a characteristic of
self-representation on Facebook.
H1b Enhanced confidence is a characteristic of self-representation on Facebook.
H2 Personality is detectable and is not mitigated via self-representation.
H2a Online self-representation cannot distort digital traces of personality that they become
H2b LIWC features detect the attributes of personality on Facebook.
This study design was reviewed by the National College of Ireland's ethics committee and
approved following a full review. The data anonymized are available under https://doi.org/10.
5281/zenodo.852652. To facilitate our study, 509 Amazon Mechanical Turk (AMT) workers
completed psychometric surveys via a Facebook application. In use for personality is the Big Five
Inventory introduced by [
], human flourishing as presented by [
] and the online social media
usage survey of [
], modified to be used for Facebook. The modified mechanisms of [
be found in the Online Appendix (S1 Text. Online Appendix to: Am I Who I Say I Am?), and are
represented as [SM#] and [HF#] forthwith. We recognize that many psychometrics exist that
could be indicative of self-representation, but the ones in use are thoroughly researched and have
strong literature-based benchmarks, and thus are the most appropriate for this analysis.
AMT has proven a reliable platform for conducting online behavioural experiments [68±
71]. AMT has been found to be more representative of diversity than standard samples, and is
similar to the standard Facebook population [
]. AMT has also been used in similar research
designs where psychometrics and Facebook are simultaneously investigated [
An initial screening question based on the Instructional Manipulation Check was employed
in order to minimize `click-through' behaviour [
] in order to increase the reliability of the
5 / 23
results. Payments of US$ 0.74 were issued at the end of the survey, equating to 1 cent per
question. Regardless of users' privacy settings allowing timeline extraction or not, all 509 workers
were paid with and for survey completion. The study was launched over a 24-hour period to
accommodate differences in time zones.
Participants' data including IDs were automatically one-way hashed for user privacy, with
timeline, survey, and worker payment being tied to the hashed ID. This is established as a best
practice in [
]. Text-based data was automatically fed into the LIWC processing tool. A
summarized privacy statement and informed consent document were presented on the entry page
of the AMT HIT (Human Intelligence Task). A full privacy statement was available, detailing
the uses of data and steps taken to guarantee privacy in line with [
]. At no point were
identifying information available to the research team, only post-processed aggregated data [
]. After the analysis for this paper was conducted all data was destroyed to completely
mitigate all possibilities of de-anonymization similar to that reported in [
] and to also ensure
that the terms and conditions of the MTurk platform were not compromised.
As participants completed the survey, a PHP-based Facebook application simultaneously
accessed and hashed their unique Facebook ID, and via Facebook's Open Graph API
(application programming interface) accessed participants' Facebook timelines for offline analysis (Fig
1). Workers were given an option to opt out of the HIT at the stage where it linked to their
Facebook profile or abandon the HIT at any other point. Privacy-aware users were able to hide
their activities from the app.
A Facebook popup screen detailed the types of data requested by the app. The app extracted
only posts, i.e., status updates, participants made to their timelines. Other post types such as
comments, shares, profile data and updates, etc. are excluded as they are not fully self-produced
texts or could be excessively identifying. While this type of constraint can create researcher bias
by potentially culling messages from the list of retrieved posts [
], we are considering the
online presentation of self. Text produced by other users or the platform do not serve the same
purpose. It is also an ethical grey zone to harvest the comments of participants' friends without
their direct consent [
We investigate the (dis)similarity between commonly applied methods for psychometric
analysis (specifically the Five Factor personality model) with a profile constructed by applying LIWC
to text data sourced from the social network platform Facebook (see Fig 2). In juxtaposing these
two profiles, we statistically analyse whether there are any relationships (latent or otherwise)
and/or predictive capabilities in the text-based profiles. Restating the general hypothesis for this
work, we expect any deviations in these profiles to be indicative of self-representation (H1).
Correspondingly, as we have a psychometric inventory for each participant to hand (via the
Five Factor personality model) we can statistically assess which components of our higher
dimensional text-based profile account for these differences (H2b). Thus, we provide
researchers with a preliminary model to redact the effects of self-representation in online platforms;
Two statistical procedures are heavily utilized in this work, namely Spearman's ρ and
Automatic Linear Modelling (SPSS Statistics version 24). In additional, a One-Way ANOVA was
performed to assess mean differences for one case and bi-nominal regression was employed in
the case of discrete choice variables. While linear relationships exist in the data, some cases are
non-normally distributed. [
] notes that Spearman's ρ outperforms other correlation
methods in cases of contaminated normal distributions, and is robust to Type III errors (correctly
rejecting the null hypothesis for the wrong reason(s)). This justifies the use of ρ rather than
6 / 23
Fig 1. Workflow illustrating the steps to acquire, analyse, and interpret text data.
Pearson's r, in spite of the fact r tests on true values rather than ranks (thus monotonic
7 / 23
Fig 2. Model representation of regression analysis.
Automatic Linear Modelling is a machine learning extension of regression modelling and is
employed for personality detection. Our analysis utilizes the boosted, best-subset model using
Adjusted R2 as the model evaluation criteria. This is consistent with data mining approaches as
suggested in [78±81]. Regression in SPSS version 24 is ruled out as it is limited to step-wise
methods only, cannot conduct an all-possible subset analysis (used here for exploratory
reasons), does not automatically identify and handle outliers, and cannot accommodate a model
with a high variable to observation ratio [
]. Automatic linear modelling is more robust
against Type I and II errors in comparison [
]. 10-fold cross validation is automatically
employed by the model [
]. It is important to note that SPSS uses cross-validation as a part
of the model building phase, therefore the individual folds have no meaning as cross-fold
validation is used as the optimisation component in boosting. This is standard in boosted
processes, as the weak learners are progressively compiled [83,84].
A boosted model explores iteratively learning weak classifiers with respect to a distribution
by adding them to a final strong classifier [
]. When weak classifiers are added, they are
typically weighted in some way that is usually related to their accuracy [
]. After a weak learner is
added, the data is reweighted. This forces misclassified predictors gain weight and predictors
that are classified correctly to lose weight. Thus, future weak learners focus more on the
predictors that previous weak learners misclassified [
]. This is supported by expanding the
model to a best subset approach. While computationally more intensive compared to the more
common stepwise approach that economizes on computational efforts by exploring only a
certain part of the model space, the all-possible-subsets approach conducts an intensive search of
a much larger model space by considering all possible regression models from the pool of
8 / 23
potential predictors . This aids prediction accuracy. Pseudo-codes for the AdaBoost
algorithm employed can be found in [
]. Outliers with a Cook's Distance smaller than one
were retained when they were observed to not have an undue influence on the data .
Boosted models are popular machine learning extensions to standard regression models, and
can be employed in high-dimensional data scenarios [
]. The process of splitting the data into
training and testing sets and cross-validating it tend to guard from overfitting [
models return strong empirical results [
] for relatively small increases in computational complexity.
Most importantly, given the approach's weight on the previous fold's misclassified results, and
assessing many weak predictors in classifying results (see above paragraph), it is expected to
return highly accurate predictions [
]. As an additional step, nested 10-fold
cross-validation was employed as a mechanism to evaluate the overarching model. Although Automated
Linear Modelling employs cross validation in boosted model training, concerns about potentially
overfitting the data can still exist. Thus, by employing nested cross-validation (cross-validating a
model built using cross-validation) additional insight into the quality and performance of the
resultant model is provided. The reported error estimations are less prone to overfitting and
therefore are more adequate for model evaluation. This procedure additionally required SPSS
Modeller (version 18) as SPSS Statistics cannot accommodate nested cross-validation.
In order to provide context, first noteworthy descriptive statistics of the data across
demographic dimensions are provided, then key data cleaning and transformation processes are
outlined. Subsequently, descriptives of each profile type; namely surveys and text-based via
LIWC are presented and discussed, before compared with each other as well as the findings of
]. Finally, a predictive model is proposed where key LIWC categories indicative of
selfrepresentation are discussed as a mechanism to control for self-representation. In order to
provide context, first noteworthy descriptive statistics of the data across demographic dimensions
are provided, key data cleaning and transformation processes are also outlined. Subsequently,
descriptives of each profile type; namely surveys and text-based via LIWC are presented and
discussed, before compared with each other as well as the findings of [
]. Finally, a
predictive model is proposed where key LIWC categories indicative of self-representation are
discussed as a mechanism to control for self-representation.
Descriptive attributes of the population
Following standard online survey guidelines [
], participants who completed in less than
nine minutes were excluded from the analysis, as well as those with unit or item non-responses
(n = 40, or 7.9% of the sample population). Participants were nearly evenly split between the
United States and India. The largest language group was English with 285 timelines
predominately using English. 73% of participants self-reported to be aged 35 or younger. Gender of the
participants is evenly split between women and men, with one non-disclosure and one choice
of `Other.' 37% reported being unemployed and 57% completed at least a bachelor's degree.
While this does not reflect a normalized population, a younger sample with higher educational
achievements is close to the Facebook population .
Of the 285 English profiles, 283 have profiles with 50 or more words over the lifetime of the
profiles. Sensitivity analyses indicated that the 50 word threshold was the lower limit for robust
results, which is 20 words shorter than the next lowest benchmark found in IBM's Personality
Insights program with its 70-word cutoff [
]. Only the 283 English profiles with more than 50
words are used for LIWC analyses unless otherwise noted. Table 1 illustrates some descriptive
categories considering the mean, standard deviation, and median of the profiles, as well as the
9 / 23
frequency of words with more than six letters and words per sentence, all measures of
linguistic maturity. The average word count per worker is 9,379, just slightly over the average of [
at 9,333 words per participant.
Self-reported attributes of self-representation
There are some generally interesting results dealing with self-reported contact patterns and
motivation of use outside of self-representation issues revealed by the Spearman's ρ and
binomial regression analyses. Participants who use Facebook frequently also update their profiles
frequently (rs(337 = .292, p < .005) [SM 1/2], though those with a higher number of friends
have a negative relationship with the frequency of logins (rs(337 = -.314, p < .005) [SM 1/3]. A
negative relationship also exists between number of Facebook friends and the number of
updates (rs(337 = -.252, p < .005) [SM 2/3].
Family, and on and offline friends are major interest areas in this sample. Participants who
use Facebook to show what they know and can are less interested in contacting family than all
other groups (on and offline friends, unknown people) (Exp(B) = 0.5, p = 0.071) [SM 9H/
SM4]. Those who mainly like status updates are most likely to contact family members (Exp
(B) = 2.320, p = 0.006) [SM 1D/SM4]. Participants who use Facebook in order to be recognized
by others and are half as likely to have offline friends on Facebook as the rest of the population
(Exp(B) = 0.550, p = 0.085), and are twice as likely to be interested in contacting family
members on Facebook (Exp(B) = 1,989, p = 0,067) [SM 9C/4]. An exception here is those who want
recognition and support from other users: they are half as likely to contact family members
(Exp(B) = 0.406, p = 0.011) [SM 9E/4]. Men are less interested in maintaining contact with
family on Facebook as women (Exp(B) = 0.393, p = 0.001) [SM4], and those who frequently
like videos are twice as likely to use Facebook for contacting their family (Exp(B) = 2.502,
p = 0.004) [SM5/4]. Participants whose profile picture does not show their face are half as likely
to want to contact offline friends and are more interested in finding unknown online friends
(Exp(B) = 0.413, p = 0.007) [SM 11F/4], as well as participants who agree with the statement `I
can determine myself what I do or do not show others' (Exp(B) = 1.344, p = 0.033) [SM14B/4].
Written attributes of self-representation on Facebook
As seen in Fig 3, participants generally communicate their positive emotions frequently (an
average of 6.16% of each timeline), where negative emotions on Facebook are hardly
communicated (2.06% of all data). This is encouraging as it is in line with LIWC standards as
established by [
]. It is also in line with the work  who name this positivity bias to be social
posturing. It must be noted that a contributing factor to this difference could be that LIWC
has been found to generally have positive polarity in its algorithm [
]. However, 60% more
words in the LIWC dictionaries are associated with negative sentiment than positive
sentiment. Given that difference, it is likely that the positivity bias in this dataset is in fact a display
of social posturing: people represent themselves to be more positive and less negative on their
10 / 23
Fig 3. Positive and negative sentiment usage across the sample population (logarithmic scale).
Facebook profiles, an affirmation of H1a. We note that this could also be a contributing factor
to the findings of [
The analysis also looked at expressed confidence as a measure of self-representation (H1b).
This is measured by the frequency in usage of first person singular and third person plural;
where people that are more confident use `I' words less than `We' words [
]. We tested the
demographic groups established in the survey with an ANOVA (Fig 4) and found a significant
difference in gender (Gender F(2,279) = 11.893, p < .0005; Wilks' Λ = .921; partial η2 = .079).
Males use more first person singular terms. Our findings cannot reject a difference between
third person plural between men and women (First Person Plural (We) F(1,280) = .643, p = .423;
partial η2 = .002), whereas first person singular has a significant difference in gendered usage
(First Person Singular (I) F(1,280) = 23.405, p < .0005; partial η2 = .077). There was
homogeneity of variance-covariance matrices, as assessed by Box's test of equality of covariance matrices (p
= .002). Males are significantly more likely to present their confidence by use of `I' words in their
online personas. Based on the findings of [
], this is an unexpected and contradictory finding.
This supports emerging findings that women express less confidence than men do, and thereby
does not support overt self-representation specific to online social networks (H2b).
Detecting personality from online responses and online discourse
In order to mitigate self-representation, the attributes indicating personality must first be
addressed. This section discusses the predictors of the variables with the strongest predictive
coefficients from the entire list of possible 136 variables (survey items and LIWC categories)
11 / 23
Fig 4. Gendered usage of confidence-expressing statements on Facebook profiles.
and also introduces models with only data that would be available from Facebook profiles (the
LIWC categories) to define the relationships between LIWC and psychometrics (Tables 2±6)
(H2b). Applying the data mining technique referred to in the Methodology section (refer to Fig
2 for the model representation), we regress 136 variables of survey responses and LIWC
categories on each of the five personality traits of the Five Factor model, then regress the 80 variables
representing LIWC categories. It is worth noting that the same process was completed for the
prediction of human flourishing. The correlations of extraversion and neuroticism to
wellbeing are strong enough ([rs(282) = .357 p < .0005] / [rs(282) = -.263 p < .0005]) that further
analyses are precluded. We introduce these attributes as personality vectors (Tables 2±6). Tables
7 and 8 display and discuss the prediction accuracy and explained variance as well as the nested
cross-validation of these values of the five traits considering all 136 variables.
Openness has the high prediction accuracy at 65.0%, and an explained variance of 47.2%.
Significant at the 0.001 level for openness are the survey categories meaning [HF 4], self-esteem
[HF 9], engagement [HF3], competence [HF 1], optimism [HF 5], positive emotion [HF 6],
and resilience [HF 9]; the country of origin of the worker; and the LIWC category `feelings.'
12 / 23
With a prediction accuracy of 66.7% and an R2 (explained variance) of 43.3%,
conscientiousness is described by the largest collection of LIWC categories of all five traits (Table 3). This
could be an indication of the nuance of this particular trait's expression in online dialogue.
Perhaps unsurprisingly, the strongest predictor of this trait is the LIWC category Assent.
The most relevant predictors are the LIWC categories, `friends', `down', and `fillers'; survey
responses `a profile picture that is not obviously me' [SM11F], number of friends [SM3], `I
understand quickly how others perceive me' [SM 14A], assent to `People should present
themselves on online social networks as the same person as they are offline' [SM 8], and using
Facebook to give and get information [SM 9K], and the survey measurement resilience [HF 9] and
positive relationships [HF 7].
Extraversion with 77.9% accuracy and R2 of 56.1% is related to the survey items competence
[HF 1], self-esteem [HF9], meaning [HF 4], optimism [HF 5], positive emotion [HF 6], vitality
[HF 10], and resilience [HF 9]; country of origin; and the survey responses `I understand
quickly how I am perceived by others' [SM 14A] and managing Facebook profiles with
displays of albums [SM 11G].
Interestingly, those scoring high in Extraversion have a positive usage of words displaying
Anger but withdrawn usage of words conveying Negative Emotions. Extroverts also use `We'
words (first person plural) more than the other traits, which could be a display of withdrawn
confidence as expressed online (Table 4).
Agreeableness has an accuracy of 63.5% and 46.3% explained variance indicating high
reliability. Highly significant are the survey items resilience [HF 8], meaning [HF 4], self-esteem [HF
9], and competence [HF 1]; country of origin; the LIWC categories `friends', `inhibition',
13 / 23
14 / 23
`feelings', and `assent'; and declination of `I can be who or what I want on my Profile page'
[SM 14D]. Unexpectedly those scoring high on this trait reflect withdrawn usage of Positive
Emotion (Table 5). They score highest of all traits in attributes capturing linguistic maturity
(Unique Words, Words per Sentence).
Neuroticism has a good performance (70.8% accuracy) and reasonable R2 (49.9%). The most
significant survey items are resilience [HF 8], self-esteem [HF 9], emotional stability [HF 2],
vitality [HF 10], and optimism [HF 5]; using Facebook to spy on others [SM 9D], managing
presentation of self with pictures not of them [SM 11F], using Facebook to observe other
people [SM 9F], and liking videos on Facebook [SM 5]. Finally, the LIWC category `feelings' is
highly significant. Table 6 displays an interaction between positive usage of personal
achievement but a withdrawn usage of References to Others±this could indicate that the discourse of
those high in neuroticism errs towards self-centred discourse.
Model performance considering benchmark works and implications
Worker's self-produced text is indicative of self-representation when compared to their
responses to the Five Factor model (H2). The Automated Linear Modelling approach in SPSS
creates meritorious model fits averaging 68.8% reference model accuracy and 48.6% explained
variance as seen in Table 7, without overt signs of data overfitting (H2a).
Nested CV Mean Linear
Schwartz et al. R2
Schwartz et al. R2 (LIWC combined with
topics and words)
Considering sizeable correlations between predictor groups, the unique variance explained
by each of the variables indexed by the squared semipartial correlations is low. In no case was
there an instance of Cook's Distance larger than one, so all outliers were handled within the
data rather than trimmed [
]. The multivariate models are statistically significant for each
personality trait (p < .05).
When nested cross-validation is additionally performed we see an average result of 0.67
(Table 7, Table 8). While the average of the model is nearly the same, indicating goodness of
the approach, there are fluctuations found in the individual constructs (Minimum and
Maximums columns, Table 8). The fluctuations in the results are assumed to be a function of the
program in use, namely that when in SPSS a linear model encounters a testing instance with a
value it hadn't anticipated (e.g. an attribute value outside the range of the training data
provided) SPSS generally predicts $null$. Table 8 compares the minimum, maximum and average
performance of nested cross-validation across the five constructs and compares the results
with those of Table 7. Per-fold results are included as Supporting Information (S1 Table.
Supporting Information Per-fold performance testing).
Our models have three major differentiators with the works of [
]. First, we find fewer
categories which are significant at the 0.05 or above level per personality trait (see Tables 2±6)
as compared to . We see the reduced dimensionality as a strength of our approach. It
indicates that the representation of the five traits is more compact than in the benchmark works
], and is likely more generalizable. Second, the strength of the coefficients in our model
are considerably higher than the LIWC-only results reported in . This implies that our
method has competitive prediction accuracy while utilizing fewer features with stronger
statistical power. Finally, an advantage of our approach (boosted, best-subset regression modelling)
is the superior performance considering explained variance. The reported explained variance
of the LIWC-only approach in [
] with a standard regression model reached an average of
26%, and 35% when combined with other features. Our approach averages an explained
variance between 56±43%. Given this work's near-replication of the psychometric instruments as
well as known relationships between them (e.g., well-being, extroversion, and neuroticism)
Min CV Linear Correlation
Mean CV Linear Correlation
Max CV Linear Correlation
16 / 23
our reported difference is unlikely to be solely due to differences in sample size. This suggests
that while other approaches (e.g., latent semantic analysis [
], open-word approaches [
or correlation studies ) are meritorious, LIWC-only approaches when combined with
machine learning extensions are also appropriate for the task. Indeed, the performance
increase in comparison with standard liner models and other linguistic approaches suggests
that future research should consider employing such (relatively light) machine learning
approaches in the future for more accurate, reliable results.
Implication: Personality is a tool for mitigating self-representation
Having established a compact representation of the five personality traits of [
] detected from
LIWC data as it represents Facebook data, researchers can use the results reported in this work
as personality vectors. Personality vectors in this case are the collated LIWC categories reported
in Tables 2±6. Researchers may apply the vectors to Facebook-based data when investigating
psychometrics in order to represent a more realistic view of the subject. This contributes a
method for social researchers to verify psychometric baselines of subjects. Having done this,
researchers are able to mitigate the effects of socially responding personas in online social
media data. This delivers a closer representation to the in real personality of the subject than is
Discussion and conclusion
The key findings of this research are that self-representation in online social media is an
identifiable phenomenon, that self-representation can be isolated, and a smaller number of
indicators than previously reported can be used to do so. Moreover, it opens an interesting
discussion on the impact of self-representation on social media analyses, both from the
perspective of the researcher validating social models, and the subject with respect to the intent
of such behaviours. To our knowledge this is the first work that validates Facebook applying
LIWC as a stand-alone tool for the identification of personality traits and
self-representation. Similar studies have validated other text inputs (e.g., [
]), or have approach feature
creation from an individual word basis [
]. Finally, the accuracy of our results was
aided by employing a machine learning extension to the regression model (boosted
regression modelling), increasing accuracy dramatically.
Self-representation was identified in a number of indicators. Positive affectivity and
withdrawn negative emotions are identifiable across the workers' profiles. Withdrawn negative
affect is a particularly indicative of self-representation (H1a). However, confidence follows
expected patterns across genders (H1b). Male participants appear more confident in their
written profiles than females. As this is a finding in emergent literature, this cannot be understood
as an overt measure of self-representation. Personality is still detectable even when
self-representation is present (H2a), and LIWC-only features have meritorious performance in
comparison to latent semantic methods like the open vocabulary approach of [
] (H2b). Our reported
accuracies were enabled by creating a fitting model for personality prediction as opposed to
off-the-shelf prediction models.
The stated aims of this research are twofold: establishing the relationship between offline
and online personalities in order to mitigate such biases in publically sourced data. In
accomplishing these goals, this research creates a generally applicable method for the design of
crossdisciplinary methods and the analysis of social media data. Such a method is impactful in both
research arenas and commercial domains, in that it allows the study designer to approximate
participant baselines without highly intrusive mechanisms. A strength of this study is its
consideration and application of the findings from recent cyberpsychology literature.
17 / 23
In a systematic manner, this research detailed the experimental design, data collection, and
analysis. Common method biases are addressed and appropriately eliminated when identified.
The method allows for replication by careful detailing of the steps, (pre)processing of data and
models built. A major contribution is addressing method biases in the harvesting and analysis
of social media data. This research utilizes the entire data stream as posted by the individual
per profile, mitigating sampling errors. It also names common markers of the phenomena of
self-representation based on simple LIWC categories and psychometrics that allow researchers
to mitigate its effects in future research. With personality and mood validated and a sentiment
analysis performed on the lifespan of a user's Facebook timeline, we can now measure the
propensity of a user to portray themselves in opposition to their truthful, psychological baseline.
We propose that researchers can apply this method of personality isolation to their analyses
of publically sourced data in order to mitigate the effects of self-representation. This supports
the goal of (Big) data-driven personality research being both precise and accurate. Such an
approach has diverse applications in that it allows for a new personality-based estimator from
which to deduce generalizations from publically accessible text onto the general population.
With self-representation identified and removed, a valid measurement of psychometrics
without necessitating expensive surveys or interviews is created.
Limitations, future work
A limitation is the sample size, which disallows larger statements about linguistic subgroups;
the non-English samples are too small for meaningful statistics. While larger than similar
cyberpsychology studies found in the related work in terms of both participant number and
volume of text, the study is still smaller than the largest Facebook studies to date [
Another drawback is that the results are tailored to Facebook±the findings of this study are
unlikely to generalize to professional networking, microblogs, or visual media sites. A
concluding remark on limitations is related to privacy. While the study obtained informed consent of
its workers, the open question remains if workers truly understood the amount of information
that was being given in the task.
Extensions of this research are closely linked to its limitations. Cross-platform analysis of
the same user for their various public profiles would give future work a more nuanced view in
the ways that social media users self-represent to different audiences. Such a work would fill
research gaps in `best' platform usage for information disbursement, creation, and influence,
as well as impact for a given network. A network analysis of users and the resulting textured
understanding of how users cluster and complement within a network would be a good area of
future research. Such an approach would also support answering the questions of why social
media users self-represent in the way they do, given a particular site.
S1 Text. Online appendix to: Am I Who I Say I Am?
S1 Table. Supporting information per-fold performance testing. Model ID O±Openness;
Model ID CÐConscientiousness; Model ID E±Extraversion; Model ID A±Agreeableness;
Model ID NÐNeuroticism
Conceptualization: Margeret Hall, Simon Caton.
18 / 23
Data curation: Margeret Hall.
Formal analysis: Margeret Hall.
Funding acquisition: Margeret Hall, Simon Caton.
Investigation: Margeret Hall.
Methodology: Margeret Hall, Simon Caton.
Project administration: Margeret Hall.
Resources: Margeret Hall, Simon Caton.
Software: Simon Caton.
Supervision: Margeret Hall, Simon Caton.
Validation: Margeret Hall.
Visualization: Margeret Hall.
Writing ± original draft: Margeret Hall, Simon Caton.
Writing ± review & editing: Margeret Hall, Simon Caton.
19 / 23
20 / 23
Warshaw J, Matthews T, Whittaker S, Kau C, Bengualid M, Smith B a. Can an Algorithm Know the
ªReal Youº? Understanding People's Reactions to Hyper-personal Analytics Systems. Proc 33rd Annu
ACM Conf Hum Factors Comput Syst. 2015; 797±806. https://doi.org/10.1145/2702123.2702274
21 / 23
84. IBM. IBM SPSS Regression 22. 2011.
22 / 23
1. Qiu L , Lin H , Leung AK , Tov W. Putting their best foot forward: emotional disclosure on Facebook . Cyberpsychol Behav Soc Netw . 2012 ; 15 : 569 ± 72 . https://doi.org/10.1089/cyber. 2012 .0200 PMID: 22924675
2. Zhao S , Grasmuck S , Martin J . Identity construction on Facebook: Digital empowerment in anchored relationships . Comput Human Behav . 2008 ; 24 : 1816 ± 1836 . https://doi.org/10.1016/j.chb. 2008 . 02 .012
3. Hogan B. The Presentation of Self in the Age of Social Media: Distinguishing Performances and Exhibitions Online . Bull Sci Technol Soc . 2010 ; 30 : 377 ± 386 . https://doi.org/10.1177/0270467610385893
4. van Dijck J. ªYou have one identityº: performing the self on Facebook and LinkedIn . Media, Cult Soc . 2013 ; 35 : 199 ± 215 . https://doi.org/10.1177/0163443712468605
5. Boyd D , Chang M , Goodman E. Representations of Digital Identity . CSCW'04 . 2004 ; 6 : 6± 10 . Available: http://www.danah.org/papers/CSCW2004Workshop.pdf
6. Das S , Kramer A . Self-Censorship on Facebook . Seventh International AAAI Conference on Weblogs and Social Media . Cambridge, USA; 2013 . pp. 120 ± 127 . https://doi.org/10.1007/b104039
7. Jungherr A , JuÈrgens P , SchoÈn H . Why the Pirate Party Won the German Election of 2009 or The Trouble With Predictions . Soc Sci Comput Rev . 2011 ; 30 : 229 ± 234 . https://doi.org/10.1177/ 0894439311404119
8. Rost M , Barkhuus L , Cramer H , Brown B. Representation and Communication: Challenges in Interpreting Large Social Media Datasets . CSCW'13 . San Antonio, TX: ACM Press; 2013 . pp. 357 ± 362 .
9. Chung J , Mustafaraj E. Can collective sentiment expressed on twitter predict political elections? Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence . San Fransisco, CA; 2011 . pp. 1770 ± 1771 . https://doi.org/10.1007/s00247-002-0848-7
10. Boyd RL , Pennebaker JW . Language-based personality: a new approach to personality in a digital world . Curr Opin Behav Sci . 2017 ; 18 : 63 ± 68 . https://doi.org/10.1016/j.cobeha. 2017 . 07 .017
11. Podsakoff PM , MacKenzie SB , Podsakoff NP . Sources of method bias in social science research and recommendations on how to control it . Annu Rev Psychol . 2012 ; 63 : 539 ± 69 . https://doi.org/10.1146/ annurev-psych- 120710 -100452 PMID: 21838546
12. Podsakoff PM , Mackenzie SB , Lee J , Podsakoff NP . Common Method Biases in Behavioral Research: A Critical Review of the Literature and Recommended Remedies . J Appl Psychol . 2003 ; 88 : 879 ± 903 . https://doi.org/10.1037/ 0021 - 9010 . 88 .5.879 PMID: 14516251
13. Ellison N , Heino R , Gibbs J . Managing Impressions Online: Self-Presentation Processes in the Online Dating Environment . J Comput Commun . 2006 ; 11 : 415 ± 441 . https://doi.org/10.1111/j.1083- 6101 . 2006 . 00020 .x
14. Lawson HM , Leck K . Dynamics of Internet Dating. Soc Sci Comput Rev . 2006 ; 24 : 189 ± 208 . https://doi. org/10.1177/0894439305283402
15. Lingel J , Naaman M , boyd danah. City, self, network: transnational migrants and online identity work . CSCW'14 . 2014 . pp. 1502 ± 1510 . https://doi.org/10.1145/2531602.25311693
16. Tamir DI , Mitchell JP . Disclosing information about the self is intrinsically rewarding . Proc Natl Acad Sci U S A . 2012 ; 109 : 8038 ± 43 . https://doi.org/10.1073/pnas.1202129109 PMID: 22566617
17. Back MD , Stopfer JM , Vazire S , Gaddis S , Schmukle SC , Egloff B , et al. Facebook profiles reflect actual personality, not self-idealization . Psychol Sci . 2010 ; 21 : 372 ±4. https://doi.org/10.1177/ 0956797609360756 PMID: 20424071
18. Hilsen AI , Helvik T. The construction of self in social medias, such as Facebook . AI Soc . 2012 ; 29 : 3± 10 . https://doi.org/10.1007/s00146-012-0426-y
19. Lin H , Qiu L . Two sites, two voices: Linguistic differences between facebook status updates and tweets . Rau PLP, editor. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). LNCS 8024 ; 2013 ;8024 LNCS: 432 ± 440 . https://doi.org/10.1007/978-3- 642 -39137-8-48
20. Pennebaker J , King L. Linguistic Styles: Language Use as an Individual . J Pers Soc Psychol . 1999 ; 77 : 1296 ± 1312 . PMID: 10626371
21. Gonzales AL , Hancock JT , Pennebaker J . Language Style Matching as a Predictor of Social Dynamics in Small Groups . Communic Res . 2010 ; 37 : 3± 19 . https://doi.org/10.1177/0093650209351468
22. Groom CJ , Pennebaker J. Words. J Res Pers . 2002 ; 36 : 615 ± 621 . https://doi.org/10.1016/S0092- 6566 ( 02 ) 00512 - 3
23. Duggan M , Ellison N , Lampe C , Lenhart A , Madden M . Pew Social Media Report 2015 [Internet]. 2014 . Available: http://www.pewinternet.org/ 2015 /01/09/social-media -update-2014/
24. Wilson RE , Gosling SD , Graham LT . A Review of Facebook Research in the Social Sciences . Perspect Psychol Sci . 2012 ; 7 : 203 ± 220 . https://doi.org/10.1177/1745691612442904 PMID: 26168459
25. Goffman E. The Presentation of Self In Everyday Life . 1st ed. New York, New York, USA: Anchor; 1959 .
26. Purdie-Vaughns V , Steele CM , Davies PG , Ditlmann R , Crosby JR . Social identity contingencies: how diversity cues signal threat or safety for African Americans in mainstream institutions . J Pers Soc Psychol . 2008 ; 94 : 615 ± 30 . https://doi.org/10.1037/ 0022 - 3514 . 94 .4.615 PMID: 18361675
27. Hoever A . Strategien und Konzepte der Selbstdarstellung auf Social Network Services am Beispiel Facebook . Berlin: Berliner Methodentreffen Qualitative Forschung; 2010 .
28. Mehra A , Kilduff M , Brass DJ . The social networks of high and low self-monitors: Implications for workplace performance . Adm Sci Q . 2001 ; 46 : 121 ± 146 .
29. Utz S , Tanis M , Vermeulen I . It is all about being popular: the effects of need for popularity on social network site use . Cyberpsychol Behav Soc Netw . 2012 ; 15 : 37 ± 42 . https://doi.org/10.1089/cyber. 2010 . 0651 PMID: 21988765
30. Gosling SD , Mason W . Internet Research in Psychology. Annu Rev Psychol . 2015 ; 66 : 877 ± 902 . https://doi.org/10.1146/annurev-psych- 010814 -015321 PMID: 25251483
31. Bazarova N , Taft J , Choi YyH , Cosley D. Managing Impressions and Relationships on Facebook: SelfPresentational and Relational Concerns Revealed Through the Analysis of Language Style . J Lang Soc Psychol . 2012 ; https://doi.org/10.1177/0261927X12456384
32. Bollen J , GoncËalves B , Ruan G , Mao H . Happiness is assortative in online social networks . Artif Life . 2011 ; 17 : 237 ± 51 . https://doi.org/10.1162/artl_a_00034 PMID: 21554117
33. Fowler J , Christakis N. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study . BMJ . 2008 ; 337 : a2338. https://doi.org/10.1136/bmj. a2338 PMID: 19056788
34. John OP , Donahue EM , Kentle RL . The big five inventoryÐversions 4a and 54 . Berkeley, USA; 1991 .
35. Yarkoni T . Personality in 100 ,000 words: A large-scale analysis of personality and word use among bloggers . J Res Pers . 2010 ; 44 : 363 ± 373 . https://doi.org/10.1016/j.jrp. 2010 . 04 .001 PMID: 20563301
36. Hall M , Kimbrough SO , Haas C , Weinhardt C , Caton S. Towards the gamification of well-being measures . 2012 IEEE 8th International Conference on E-Science, e-Science 2012 . Ieee; 2012 . pp. 1 ± 8 . https://doi.org/10.1109/eScience. 2012 .6404457
37. Hall M , Caton S , Weinhardt C . Well-being's Predictive Value . In: Ozok AA , Zaphiris P , editors. Proceedings of the 15th International Conference on Human-Computer Interaction (HCII) . Berlin: LNCS, Springer Verlag; 2013 . pp. 13 ± 22 . https://doi.org/10.1007/978-3- 642 -39371-6-2
38. DeNeve KM , Cooper H. The happy personality: a meta-analysis of 137 personality traits and subjective well-being . Psychol Bull . 1998 ; 124 : 197 ± 229 . PMID: 9747186
40. John O , Naumann L , Soto C . Paradigm Shift to the Integrative Big Five Trait Taxonomy . Handbook of Personality . 2008 . pp. 114 ± 158 . https://doi.org/10.1016/S0191- 8869 ( 97 ) 81000 - 8
41. Tausczik Y , Pennebaker J . The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods . J Lang Soc Psychol . 2010 ; 29 : 24 ± 54 . https://doi.org/10.1177/0261927X09351676
42. Kramer A . An Unobtrusive Behavioral Model of ªGross National Happiness . º Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Atlanta , USA; 2010 . pp. 287 ± 290 . https:// doi.org/10.1145/1753326.1753369
43. Kramer A. The spread of emotion via facebook . Proceedings of the 2012 ACM annual conference on Human Factors in Computing SystemsÐCHI `12 . New York, New York, USA: ACM Press; 2012 . pp. 767 ± 770 . https://doi.org/10.1145/2207676.2207787
44. Kramer A , Guillory JE , Hancock J . Experimental evidence of massive-scale emotional contagion through social networks . Proc Natl Acad Sci . 2014 ; 111 : 8788 ± 8790 . https://doi.org/10.1073/pnas. 1320040111 PMID: 24889601
45. Lindner A , Hall M , Niemeyer C , Caton S. BeWell: A Sentiment Aggregator for Proactive Community Management . CHI'15 Extended Abstracts . Seoul, Korea: ACM Press; 2015 . pp. 1055 ± 1060 . http://dx. doi.org/10.1145/2702613.2732787
46. Chung C , Pennebaker J . Counting little words in Big Data: The Psychology of Communities, Culture, and History . In: Forgas J , Vincze O , Laszlo J , editors. Social Cognition and Communication . New York, New York, USA: Psychology Press; 2014 . pp. 25 ± 42 .
47. Campbell RS , Pennebaker J. The secret life of pronouns: Flexibility in writing stryle and physical health . Psychol Sci . 2003 ; 14 : 60 ± 65 . https://doi.org/10.1111/ 1467 - 9280 .01419 PMID: 12564755
48. Schwartz HA , Eichstaedt J , Kern M , Dziurzynski L , Ramones S , Agrawal M , et al. Personality, gender, and age in the language of social media: the open-vocabulary approach . PLoS One . 2013 ; 8: e73791 . https://doi.org/10.1371/journal.pone. 0073791 PMID: 24086296
49. Ott M , Choi Y , Cardie C , Hancock J. Finding Deceptive Opinion Spam by Any Stretch of the Imagination . Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 . 2011 . pp. 309 ± 319 .
50. Newman M , Pennebaker J , Berry D , Richards J. Lying Words: Predicting Deception From Linguistic Styles . Personal Soc Psychol Bull . 2003 ; 29 : 665 ± 675 . https://doi.org/10.1177/0146167203251529
51. Salas-ZaÂrate M del P , LoÂpez-LoÂpez E , Valencia-GarcÂõa R , Aussenac-gilles N , Almela AÂ , Alor-HernaÂndez G . A study on LIWC categories for opinion mining in Spanish reviews . J Inf Sci . 2014 ; 1 : 1± 13 . https://doi.org/10.1177/0165551510000000
52. Mahmud J. Why Do You Write This ? Prediction of Influencers from Word Use Psycholinguistic Analysis from text . ICWSM . Ann Arbor, USA; 2014 . pp. 603 ± 606 .
53. Markovikj D , Gievska S. Mining Facebook Data for Predictive Personality Modeling . Proc of WCPR13 , in . . .. 2013 . pp. 23 ± 26 . Available: http://clic.cimec.unitn.it/fabio/wcpr13/markovikj_wcpr13.pdf
54. Farnadi G , Zoghbi S , Moens M , Cock M De. Recognising Personality Traits Using Facebook Status Updates . Work Comput Personal Recognit Int AAAI Conf weblogs Soc media . 2013 ; 14 ± 18 . Available: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/viewPDFInterstitial/6245/6309
55. Komisin M , Guinn C . Identifying Personality Types Using Document Classification Methods . Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference. Palo Alto, USA; 2012 . pp. 232 ± 237 .
56. Balahur A , Hermida JM . Extending the EmotiNet Knowledge Base to Improve the Automatic Detection of Implicitly Expressed Emotions from Text . LREC . Istanbul, Turkey; 2012 . pp. 1207 ± 1214 .
57. Beasley A , Mason W. Emotional States vs . Emotional Words in Social Media. Proceedings of ACM WebSci'15 . Oxford, England: ACM Press; 2015 . https://doi.org/10.1145/2786451.2786473
58. GoncËalves P , ArauÂjo M , Benevenuto F , Cha M. Comparing and combining sentiment analysis methods . Proc first ACM Conf Online Soc networksÐCOSN `13 . 2013 ; 27 ± 38 . https://doi.org/10.1145/2512938. 2512951
59. ArauÂjo M , GoncËalves P , Cha M , Benevenuto F. iFeel: A Web System that Compares and Combines Sentiment Analysis Methods . International World Wide Web Conference Committee (IW3C2) . 2014 . doi:http://dx.doi.org/10.1145/2567948.2577013.
60. Caton S , Hall M , Weinhardt C . How do politicians use Facebook? An applied Social Observatory . Big Data Soc. SAGE Publications ; 2015 ; 2 : 2053951715612822 . https://doi.org/10.1177/ 2053951715612822
61. Park G , Schwartz HA , Eichsteadt JC , Kern ML , Kosinski M , Stillwell DJ , et al. Automatic personality assessment through social media language . J Pers Soc Psychol . 2015 ; 108 : 1± 25 . https://doi.org/10. 1037/pspp0000020
62. Lambiotte R , Kosinski M. Tracking the Digital Footprints of Personality . Proc IEEE . 2014 ; 102 : 1934 ± 1939 . https://doi.org/10.1109/JPROC. 2014 .2359054
63. Wang N , Kosinski M , Stillwell D , Rust J. Can Happiness be Measured using Facebook status updates? 2010 ;
64. Youyou W , Kosinski M , Stillwell D . Computer-based personality judgments are more accurate than those made by humans . Proc Natl Acad Sci . 2015 ; https://doi.org/10.1073/pnas.1418680112
65. Hall M , Glanz S , Caton S , Weinhardt C . Measuring Your Best You: A Gamification Framework for Wellbeing Measurement . Third International Conference on Social Computing and its Applications . Karlsruhe, Germany: IEEE; 2013 . pp. 277 ± 282 . https://doi.org/doi:10.1109/CGC. 2013 .51
66. Huppert F , So TTC . Flourishing Across Europe: Application of a New Conceptual Framework for Defining Well-Being . Soc Indic Res . 2013 ; 110 : 837 ± 861 . https://doi.org/10.1007/s11205-011 -9966-7 PMID: 23329863
67. Ewig C. Social Media: Theorie und Praxis digitaler SozialitaÈt / Social media: theory and practice of digital sociality . In: Anastasiadis M , Thimm C , editors. Social Media: Theorie und Praxis digitaler SozialitaÈt . Frankfurt am Main: Peter Lang Internationaler Verlag der Wissenschaten; 2011 .
68. Berinsky AJ , Huber G , Lenz GS . Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk . Polit Anal . 2012 ; 20 : 351 ± 368 . https://doi.org/10.1093/pan/mpr057
69. Paolacci G , Chandler J , Ipeirotis P . Running experiments on Amazon Mechanical Turk . Judgm Decis Mak . 2010 ; 5 : 411 ± 419 .
70. Ross J , Zaldivar A , Irani L , Tomlinson B . Who are the Turkers? Worker Demographics in Amazon Mechanical Turk . CHI 2010 . 2010 . pp. 2863 ± 2872 .
71. Mason W , Suri S. Conducting behavioral research on Amazon's Mechanical Turk . Behav Res Methods . 2012 ; 44 : 1± 23 . https://doi.org/10.3758/s13428-011 -0124-6 PMID: 21717266
72. Yearwood MH , Cuddy A , Lamba N , Youyou W , van der Lowe I , Piff PK , et al. On wealth and the diversity of friendships: High social class people around the world have fewer international friends . Pers Individ Dif . 2015 ; 87 : 224 ± 229 . https://doi.org/10.1016/j.paid. 2015 . 07 .040
73. Lease M , Hullman J , Bigham JP , Bernstein MS , Kim J , Lasecki W , et al. Mechanical turk is not anonymous . Soc Sci Res Network . 2013 ; 15 . doi:http://dx.doi.org/10.2139/ssrn.2228728
74. Clifford S , Jewell RM , Waggoner PD . Are samples drawn from Mechanical Turk valid for research on political ideology? Res Polit . 2015 ; 2: 1±9 . https://doi.org/10.1177/2053168015622072
75. Zimmer M. ªBut the data is already publicº: on the ethics of research in Facebook . Ethics Inf Technol . 2010 ; 12 : 313 ± 325 . https://doi.org/10.1007/s10676-010-9227-5
76. GonzaÂlez-BailoÂn S , Wang N , Rivero A , Borge-Holthoefer J . Assessing the bias in samples of large online networks . Soc Networks . Elsevier B.V. ; 2014 ; 38 : 16 ± 27 . https://doi.org/10.1016/j.socnet. 2014 . 01 .004
77. Fowler RL . Power and Robustness in Product-Moment Correlation . Appl Psychol Meas . 1987 ; 11 : 419 ± 428 .
78. Schonlau M. Boosted regression (boosting): An introductory tutorial and a Stata plugin . Stata J . 2005 ; 5 : 330 ± 354 . doi: The Stata Journal
79. Li Q , Racine JS . Cross-Validation Local Linear Nonparametric Regression . Stat Sin . 2004 ; 14 : 485 ± 512 .
80. Hurvich CM , Simonoff JS , Tsai C . Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion . J R Stat Soc Ser B . 1998 ; 60 : 271 ± 293 .
81. Cleveland WS , Devlin SJ . Locally Weighted Regression: An Approach to Regression Analysis by Local Fifing . J Am Stat Assoc . 1988 ; 83 : 596 ± 610 .
82. Yang H. The Case for Being Automatic: Introducing the Automatic Linear Modeling (LINEAR) Procedure in SPSS Statistics . Mult Linear Regres Viewpoints . 2013 ; 39 : 27 ± 37 .
83. IBM. IBM SPSS Advanced Statistics 22 . 2011 .
85. FernaÂndez-Delgado M , Cernadas E , Barro S , Amorim D , Amorim FernaÂndez-Delgado D . Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J Mach Learn Res . 2014 ; 15 : 3133 ± 3181 . https://doi.org/10.1016/j.csda. 2008 . 10 .033
86. Tulyakov S , Jaeger S , Govindaraju V , Doermann D . Review of Classifier Combination Methods . Rev Classif Comb Methods . 2007 ; 90 : 361 ± 386 . https://doi.org/10.1007/978-3- 540 -76280-5
87. Schapire RE . The Boosting Approach to Machine Learning: An Overview . Nonlinear Estimation and Classification . 2003 . pp. 149 ± 171 . https://doi.org/10.1007/978-0- 387 -21579- 2 _ 9
88. Cook RD , Weisberg S. Residuals and Influence in Regression. 1982 . https://doi.org/10.2307/1269506
89. BuÈhlmann P. Boosting for high-dimensional linear models . Ann Stat . 2006 ; 34 : 559 ± 583 . https://doi.org/ 10.1214/009053606000000092
90. Bosnjak M , Tuten TL . Classifying Response Behaviors in Web-based Surveys . J Comput Commun . 2001 ; 6 : 14 .
91. Galesic M , Bosnjak M. Effects of Questionnaire Length on Participation and Indicators of Response Quality in a Web Survey . Public Opin Q . 2009 ; 73 : 349 ± 360 . https://doi.org/10.1093/poq/nfp031
92. Mahmud J. IBM Watson Personality Insights: The science behind the service [Internet] . Almaden, USA: IBM; 2015 . Available: https://developer.ibm.com/watson/blog/2015/03/23/ibm-watson -personalityinsights-science-behind-service/
93. Pennebaker J , Mehl MR , Niederhoffer KG . Psychological aspects of natural language use: our words, our selves . Annu Rev Psychol . 2003 ; 54 : 547 ± 77 . https://doi.org/10.1146/annurev.psych. 54 .101601. 145041 PMID: 12185209
94. Deerwester S , Dumais ST , Furnas GW , Landauer TK . Indexing by Latent Semantic Analysis . J Am Soc Inf Sci . 1998 ; 41 : 391 ± 407 .