EPJ Data Science

EPJ Data Science is a peer-reviewed open access journal published under the SpringerOpen brand.Data-driven science is rapidly emerging as a complementary ...

List of Papers (Total 384)

Scaling hermeneutics: a guide to qualitative coding with LLMs for reflexive content analysis

Qualitative coding, or content analysis, is more than just labeling text: it is a reflexive interpretive practice that shapes research questions, refines theoretical insights, and illuminates subtle social dynamics. As large language models (LLMs) become increasingly adept at nuanced language tasks, questions arise about whether—and how—they can assist in large-scale coding...

High earnings through firm influence: the role of hierarchical structures in public procurement

Public procurement, a critical but often overlooked aspect of governance, plays a pivotal role in steering the acquisition of goods, services and the commissioning of public works. Our study, analyzing over one million public procurement contracts from the Portuguese public administration, applies network science to unravel the complexities of this market. We uncover a market...

Unsupervised detection of coordinated information operations in the wild

This paper introduces and tests an unsupervised method for detecting novel coordinated inauthentic information operations (CIOs) in realistic settings. This method uses Bayesian inference to identify groups of accounts that share similar account-level characteristics and target similar narratives. We solve the inferential problem using amortized variational inference, allowing us...

Socioeconomic disparities in mobility behavior during the COVID-19 pandemic in developing countries

Mobile phone data have played a key role in quantifying human mobility during the COVID-19 pandemic. Existing studies on mobility patterns have primarily focused on regional aggregates in high-income countries, obfuscating the accentuated impact of the pandemic on the most vulnerable populations. Leveraging geolocation data from mobile-phone users and population census for 6...

The impact of playlist characteristics on coherence in user-curated music playlists

Music playlist creation is a crucial, yet not fully explored task in music data mining and music information retrieval. Previous studies have largely focused on investigating diversity, popularity, and serendipity of tracks in human- or machine-generated playlists. However, the concept of playlist coherence – vaguely defined as smooth transitions between tracks – remains poorly...

Lost in translation: using global fact-checks to measure multilingual misinformation prevalence, spread, and evolution

Misinformation and disinformation are growing threats in the digital age, affecting people across languages and borders. However, no research has investigated the prevalence of multilingual misinformation and quantified the extent to which misinformation diffuses across languages. This paper investigates the prevalence and dynamics of multilingual misinformation through an...

Mapping global value chains at the product level

Value chain data is crucial for navigating economic disruptions. Yet, despite its importance, we lack publicly available product-level value chain datasets, since resources such as the “World Input-Output Database”, “Inter-Country Input-Output Tables”, “EXIOBASE”, and “EORA”, lack information about products (e.g. Radio Receivers, Telephones, Electrical Capacitors, LCDs, etc.) and...

Using semantic similarity to measure the echo of strategic communications

Many actors use strategic communications to impact media debates through targeted messages and campaigns, but the scale and diversity of online media content make it difficult to evaluate the impact of a particular message or campaign. In this paper, we present a new technique that leverages semantic similarity of actor messages and media content to quantify the change in media...

Assessing geographic polarisation in Britain’s digital landscape through stable dynamic embedding of spatial web data

This paper employs Unfolded Adjacency Spectral Embedding (UASE) to investigate the temporal evolution of economic relationships between locations in Great Britain. We utilise timestamped, geolocated website hyperlinks data between archived, commercial websites in Britain, which are aggregated to create a set of directed, weighted networks of hyperlink connections between Local...

Unmasking social bots: how confident are we?

Social bots remain a major vector for spreading disinformation on social media and a menace to the public. Despite the progress made in developing multiple sophisticated social bot detection algorithms and tools, bot detection remains a challenging, unsolved problem that is fraught with uncertainty due to the heterogeneity of bot behaviors, training data, and detection algorithms...

Entropy-based text feature engineering approach for forecasting financial liquidity changes

Changes in individual and institutional financial behavior leading to shifts in liquidity flows often depend on events reflected in news. However, the task of establishing relationship between financial behavior and news remains challenging and understudied. We propose a news-based feature generation approach that allows accounting for news events in liquidity flow time-series...

Demographic disparity in Wikipedia coverage: a global perspective

Despite decades-long efforts to increase diversity, underrepresented social groups remain small minorities in many fields. Here, we ask whether disparities in global recognition exist for traditionally underrepresented demographic groups. We investigate whether a notable person’s demographic attributes are associated with their global recognition, considering both the global...

Weakly supervised veracity classification with LLM-predicted credibility signals

Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity of online content. Automating the extraction of credibility signals presents significant challenges due to the necessity of training high-accuracy, signal-specific extractors, coupled with the lack of sufficiently large annotated datasets. This paper...

Addressing long-tailed distribution in judicial text for criminal motive classification: a balanced contrastive learning approach

Understanding criminal motives is crucial for analyzing criminal psychology and predicting judicial outcomes. Traditional methods for crime motive analysis are heavily based on statistical techniques, requiring specialized knowledge and substantial human resources. With the increasing availability of judicial data, such as legal documents, machine learning approaches hold great...

Understanding stock market instability via graph auto-encoders

Understanding stock market instability is a key question in financial management as practitioners seek to forecast breakdowns in long-run asset co-movement patterns which expose portfolios to rapid and devastating collapses in value. These disruptions are linked to changes in the structure of market wide stock correlations which increase the risk of high volatility shocks. The...

Longitudinal modularity, a modularity for link streams

Temporal networks are commonly used to model real-life phenomena. When these phenomena represent interactions and are captured at a fine-grained temporal resolution, they are modeled as link streams. Community detection is an essential network analysis task. Although many methods exist for static networks, and some methods have been developed for temporal networks represented as...

The microvelocity of money in Ethereum

The transfer velocity of money is a macroeconomic quantity that measures the frequency of exchanges in an economy. For cryptoassets it can be exactly measured adopting a new approach, MicroVelocity. In this study we apply the framework to Ether, the native cryptocurrency of the Ethereum blockchain, to investigate velocity and its top contributors and how they can be characterised...

Social media warfare: investigating human-bot engagement in English, Japanese and German during the Russo-Ukrainian war on Twitter and Reddit

The Russo-Ukrainian War represents a significant contemporary conflict between two global powers, yet the dynamics of human-bot engagement during this conflict, particularly on social media platforms like Twitter and Reddit, remain underexplored. Existing literature has not adequately addressed how bots and humans interact differently across languages within this geopolitical...

Uncovering and estimating complementarity in urban lives

We typically think of the demand volume for a business in a city as a function of basic characteristics, such as the type of business, the quality of the product or service offered and its pricing. In addition, factors related to the urban environment, such as population density and accessibility are also crucial and have been considered in the literature. However, these...

Connectivity and community structure of online and register-based social networks

The dominance of online social media data as a source for large-scale social network studies has recently been challenged by networks constructed from state-curated register data. In this paper focused on the cross-comparison of the network structures, we investigate the similarities and differences of the Dutch online social network (OSN) Hyves and a register-based social...

Resilience-oriented passenger subsidy design for taxi travel under pandemic control

Summarizing historical pandemic control experience can help the government better cope with the impact of uncertain public health events on taxi industry. This paper presents a summary of the relationship between various pandemic control measures and taxi system from the perspective of travel resilience. Additionally, we investigate the effectiveness of passenger subsidy schemes...

Correlation and autocorrelation of data on complex networks

Networks where each node has one or more associated numerical values are common in applications. This work studies how summary statistics used for the analysis of spatial data can be applied to non-spatial networks for the purposes of exploratory data analysis. We focus primarily on Moran-type statistics and discuss measures of global autocorrelation, local autocorrelation and...

Evolution of sample-based music authorship network

Sample-based music—characterized by the adoption of extant audio fragments (sampling) in its creation process—plays an essential role in contemporary popular music, fostering inter-generational connections between the creators that have resulted in a rich and diverse sonic landscape. The selection, manipulation, and adoption of samples heavily impact the genre, mood, texture, and...

A new approach to estimate neighborhood socioeconomic status using supermarket transactions and GNNs

Ending poverty in all its forms everywhere remains the number one Sustainable Development Goal of the United Nations 2030 Agenda. Governments face challenges in measuring socioeconomic status with fine spatial resolution because traditional data collection methods, such as censuses and surveys, are time-consuming, labor-intensive, performed at long intervals, and cover only a...