Prediction of employment and unemployment rates from Twitter daily rhythms in the US
Bokányi et al. EPJ Data Science
Prediction of employment and unemployment rates from Twitter daily rhythms in the US
Eszter Bokányi
Zoltán Lábszki
Gábor Vattay
By modeling macro-economical indicators using digital traces of human activities on mobile or social networks, we can provide important insights to processes previously assessed via paper-based surveys or polls only. We collected aggregated workday activity timelines of US counties from the normalized number of messages sent in each hour on the online social network Twitter. In this paper, we show how county employment and unemployment statistics are encoded in the daily rhythm of people by decomposing the activity timelines into a linear combination of two dominant patterns. The mixing ratio of these patterns defines a measure for each county, that correlates significantly with employment (0.46 ± 0.02) and unemployment rates (-0.34 ± 0.02). Thus, the two dominant activity patterns can be linked to rhythms signaling presence or lack of regular working hours of individuals. The analysis could provide policy makers a better insight into the processes governing employment, where problems could not only be identified based on the number of officially registered unemployed, but also on the basis of the digital footprints people leave on different platforms.
unemployment prediction; Twitter; social media; activity patterns
1 Introduction
Until recently, it has been a time-consuming, costly and arduous work to collect and
analyze data about individual humans at a large scale. With the advent of the digital era, there
is a growing amount of data accessible online that enables the analysis and modeling of
human behavior. However, our understanding of these digital data sources and the methods
that connect the data to real-world outcomes is still limited.
Several aspects on the possible usage of mobile phone records and social media status
updates in the estimation of official data, such as census, demographic or land use records
have been discussed in recent papers. A promising approach is the analysis of the diurnal
rhythm of humans. Due to the hour periodicity of the Earth’s rotation, we are
biologically bound to show daily periodic behavior both at the individual and at the aggregate
level. This periodic cycle is governed mainly by internal biochemical processes [–], but
the impact of external factors and the environment also leaves its imprint on these daily
patterns [, ].
As Säramaki and Moro point out in their paper [], an interesting application is to
consider the geospatial aspects of the aggregate level of daily rhythms, as it can provide insight
into several different phenomena ranging from the actual land use patterns in a city [–]
and on a campus [], to the tracking of anomalous events [, ], or the estimation of
population size [], mobility patterns [], poverty [] or crime rates [] in a certain
area.
Because these aggregate patterns always consist of the superposition of the daily
rhythms of individuals, it is worth investigating how the main features of the aggregate
level form from superposition. If we can cluster individuals into more or less
homogeneously behaving groups based on their daily patterns [], then the aggregate pattern
can be understood as the combination of the group patterns, and the group that has more
individuals dominates the aggregate daily rhythm. The groups of individuals can form
along many demographic and/or socioeconomic factors, of which being employed and
going to and from work at regular hours is the most determining one with respect to the
daily activity patterns. Thus, decomposing the groups from the aggregate patterns in
different geographical regions may give insight into the estimation of employment statistics
in that region.
Nowcasting or estimating unemployment rates using the digital traces of search engines
has already been in the focus of several papers [–]. It has already been shown, that
daily activity patterns of individuals can be linked to the regularity of their working hours
[]. Because the loss of a job has severe psychological consequences [], the effects of a
mass layoff can be detected in the unemployment rates and provide a possibility of
forecasting macro-economical effects based on observation of several individuals []. In [],
there is a strong evidence that aggregated daily activities of certain time intervals of
geographical regions can be indicative of unemployment rates.
In this paper we obtain million geolocated messages from the publicly available
stream of the social network Twitter from the area of the United States sent between
January and October . We aggregate Monday to Friday relative tweeting activity for each
hour in each US county to form an average workday activity pattern. We then assume that
these activity patterns form a roughly linear subspace of the -hour “timespace”. By
finding this linear subspace, that is, by find (...truncated)