# International Journal of Data Science and Analytics

## List of Papers (Total 86)

#### Scoring Bayesian networks of mixed variables

In this paper we outline two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables, that is, mixed variables. While much work has been done in the domain of automated Bayesian network learning, few studies have investigated this task in the presence of both continuous and discrete variables while focusing on scalability. Our ...

#### Scalable Twitter user clustering approach boosted by Personalized PageRank

Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may produce ...

#### Deep learning for detecting inappropriate content in text

Today, there are a large number of online discussion fora on the internet which are meant for users to express, discuss and exchange their views and opinions on various topics. For example, news portals, blogs, social media channels such as youtube. typically allow users to express their views through comments. In such fora, it has been often observed that user conversations ...

#### BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams

Supporting sequential pattern mining from data streams is nowadays a relevant problem in the area of data stream mining research. Actual proposals available in the literature are based on the well-known PrefixSpan approach and are, indeed, able to effectively bound the error of discovered patterns. This approach foresees the idea of dividing the target stream in a collection of ...

#### Analyzing evolving stories in news articles

There is an overwhelming number of news articles published every day around the globe. Following the evolution of a news story is a difficult task given that there is no such mechanism available to track back in time to discover and study the hidden relationships between relevant events in digital news feeds. The techniques developed so far to extract meaningful information from a ...

#### NDlib: a python library to model and analyze diffusion processes over complex networks

Nowadays the analysis of dynamics of and on networks represents a hot topic in the social network analysis playground. To support students, teachers, developers and researchers, in this work we introduce a novel framework, namely NDlib, an environment designed to describe diffusion simulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by different ...

#### Feature selection for spatially enhanced LBP: application to face recognition

Block-based local binary patterns a.k.a. enhanced local binary patterns (ELBPs) have proven to be a highly discriminative descriptor for face recognition and image retrieval. Since this descriptor is mainly composed by histograms, little work (if any) has been done for selecting its relevant features (either the bins or the blocks). In this paper, we address feature selection for ...

#### Using semantic graphs to detect overlapping target events and story lines from newspaper articles

Event detection from text data is an active area of research. While the emphasis in the literature has been on event identification and labeling using a single data source, this work considers event and story line detection when using a large number of data sources. In this setting, it is natural for different events in the same domain, e.g., violence, sports, politics, to occur at ...

#### Preventing the diffusion of information to vulnerable users while preserving PageRank

Limiting the diffusion of information in social networks is important in viral marketing and computer security. To achieve this, existing works aim to prevent the diffusion of information to as many nodes as possible, by deleting a given number of edges. Thus, they adopt a collective approach and quantify the impact of deletion on the graph, based on the number of deleted edges. In ...

#### Declarative data analysis

The relational database model constituted a major breakthrough in database technology. It provided a conceptual model for data storage and retrieval that made querying databases much easier. A crucial aspect of this was the introduction of declarative query languages, such as SQL. Today, databases are not only used for retrieving data, but also for analyzing them. While the science ...

#### Discovering co-location patterns with aggregated spatial transactions and dependency rules

Co-location pattern mining focuses on finding associations among spatial features. Existing co-location pattern mining techniques mainly rely on frequency based thresholds which discard the rare patterns and find the noisy patterns. This could be avoided by evaluating co-location patterns based on their statistical significance. Recent studies focused on association rule mining ...

#### Online conformance checking: relating event streams to process models using prefix-alignments

Companies often specify the intended behaviour of their business processes in a process model. Conformance checking techniques allow us to assess to what degree such process models and corresponding process execution data correspond to one another. In recent years, alignments have proven extremely useful for calculating conformance checking statistics. Existing techniques to ...

#### Latent sentiment topic modelling and nonparametric discovery of online mental health-related communities

Social media are an online means of interaction among individuals. People are increasingly using social media, especially online communities, to discuss health concerns and seek support. Understanding topics, sentiment, and structures of these communities informs important aspects of health-related conditions. There has been growing research interest in analysing online mental ...

#### A data mining framework for environmental and geo-spatial data analysis

Mining geo-spatial data is an important task in many application domains, such as environmental science, geographic information science, and social networks. In this paper, we introduce a data mining framework, which includes pre-processing of environmental and geo-spatial data, geo-spatial data mining techniques, and visual analysis of environmental and geo-spatial data. In ...

#### Visual analytics of high-frequency lake monitoring data

In recognizing the cumulative effects of multiple stressors on altering aquatic ecosystem function, scientists have become increasingly interested in capturing high-frequency response variables using a variety of sensors. This practice has led to a demand for novel ways to visualize and analyze the wealth of data in order to meet policy and management goals. Time series data ...

#### SPIN: cleaning, monitoring, and querying image streams generated by ground-based telescopes for space situational awareness

With the increasing number of objects in earth orbits, space situational awareness (SSA) becomes critical to space safety. As an economical option, ground-based telescopes can be deployed around the world and continuously provide imaginary information of space objects. However, they also raise unique challenges regarding big, noisy, and streaming data processing. In this paper, we ...

#### HierFlat: flattened hierarchies for improving top-down hierarchical classification

Large-scale classification of structured data where classes are organized in a hierarchical structure is an important area of research. Top-down approaches that leverage the hierarchy during the learning and prediction phase are efficient for solving large-scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation, i.e., prediction ...

#### Hiding outliers in high-dimensional data spaces

Detecting outliers in very high-dimensional data is crucial in many domains. Due to the curse of dimensionality, one typically does not detect outliers in the full space, but in subspaces of it. More specifically, since the number of subspaces is huge, the detection takes place in only some subspaces. In consequence, one might miss hidden outliers, i.e. outliers only detectable in ...

#### A spectral clustering approach for multivariate geostatistical data

Spectral clustering has recently become one of the most popular modern clustering methods for conventional data. However, applied to geostatistical data, the general spectral clustering method produces clusters that are spatially non-contiguous which is certainly undesirable for many geoscience applications. In this paper, a spectral clustering approach is proposed, allowing to ...

#### Efficient identification of Tanimoto nearest neighbors

Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the ...

#### Scalable and flexible clustering solutions for mobile phone-based population indicators

Mobile phones have an unprecedented rate of penetration across the world. Such devices produce a large amount of data that have been used on different domains. In this work, we make use of mobile calls to monitor the presence of individuals region by region. Traditionally, this activity has been conducted by means of censuses and surveys. Nowadays, technologies open new ...

#### Tell cause from effect: models and evaluation

Causal relationships differ from statistical relationships, and distinguishing cause from effect is a fundamental scientific problem that has attracted the interest of many researchers. Among causal discovery problems, discovering bivariate causal relationships is a special case. Causal relationships between two variables (“X causes Y” or “Y causes X”) belong to the same Markov ...

#### Using data to build a better EM: EM* for big data

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization ...