International Journal of Data Science and Analytics

http://link.springer.com/journal/41060

List of Papers (Total 105)

Stable Bayesian optimization

Tuning hyperparameters of machine learning models is important for their performance. Bayesian optimization has recently emerged as a de-facto method for this task. The hyperparameter tuning is usually performed by looking at model performance on a validation set. Bayesian optimization is used to find the hyperparameter set corresponding to the best model performance. However, in...

Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization

Credit card fraud detection is a very challenging problem because of the specific nature of transaction data and the labeling process. The transaction data are peculiar because they are obtained in a streaming fashion, and they are strongly imbalanced and prone to non-stationarity. The labeling is the outcome of an active learning process, as every day human investigators contact...

Elliptical modeling and pattern analysis for perturbation models and classification

The characteristics of a feature vector in the transform domain of a perturbation model differ significantly from those of its corresponding feature vector in the input domain. These differences—caused by the perturbation techniques used for the transformation of feature patterns—degrade the performance of machine learning techniques in the transform domain. In this paper, we...

Spatial-aware hyperspectral image classification via multifeature kernel dictionary learning

Sparse representation based on dictionary learning has yielded impressive effects on hyperspectral image (HSI) classification. But most of these methods utilize only the single spectral feature of HSI and advanced features are not considered, such that the discriminability of sparse representation coefficients is relatively weak. In this paper, we propose a novel multifeature...

Data Science: a proposal for a curriculum

We define Data Science as the combination of statistical and computational data analytic approaches. We argue that only this combination allows to tackle the many problems occurring in today’s Big Data era. We outline a possible curriculum, which focuses on both statistics and computer science aspects of data analytics. The proposed curriculum is implemented in the Data Science...

Prospective crowdsensing versus retrospective ratings of tinnitus variability and tinnitus–stress associations based on the TrackYourTinnitus mobile platform

Many symptoms of neuropsychiatric disorders, such as tinnitus, are subjective and vary over time. Usually, in interviews or self-report questionnaires, patients are asked to retrospectively report symptoms as well as their severity, duration and influencing factors. However, only little is known to what degree such retrospective reports reflect the actual experiences made in...

Large-scale asynchronous distributed learning based on parameter exchanges

In many distributed learning problems, the heterogeneous loading of computing machines may harm the overall performance of synchronous strategies, as each machine begins its new computations after receiving an aggregated information from a master and any delay in sending local information to the latter may be a bottleneck. In this paper, we propose an effective asynchronous...

Three controversies in health data science

The routine operation of modern healthcare systems produces a wealth of data in electronic health records, administrative databases, clinical registries, and other clinical systems. It is widely acknowledged that there is great potential for utilising these routine data for health research to derive new knowledge about health, disease, and treatments. However, the reuse of...

Data science as a language: challenges for computer science—a position paper

In this paper, I posit that from a research point of view, Data Science is a language. More precisely Data Science is doing Science using computer science as a language for datafied sciences; much as mathematics is the language of, e.g., physics. From this viewpoint, three (classes) of challenges for computer science are identified; complementing the challenges the closely...

Data Science: the impact of statistics

In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps...

Comparison of strategies for scalable causal discovery of latent variable models from mixed data

Modern technologies allow large, complex biomedical datasets to be collected from patient cohorts. These datasets are comprised of both continuous and categorical data (“Mixed Data”), and essential variables may be unobserved in this data due to the complex nature of biomedical phenomena. Causal inference algorithms can identify important relationships from biomedical data...

BJR-tree: fast skyline computation algorithm using dominance relation-based tree structure

High-throughput label-free single-cell screening technology has been studied for the noninvasive analysis of various kinds of cells. Selecting the prominent cells with extreme features from a large number of cells is an important and interesting problem, which we call the serendipitous searching problem (SSP). In the SSP, it is important to find entries located near the rind of...

Personalized market response analysis for a wide variety of products from sparse transaction data

Advanced database marketing is designed to ascertain individual customers’ market responses with a discount or display of widely various products from transaction data. However, transaction data recorded in a supermarket or electric commerce are fundamentally sparse because most customers purchase only a few products from all products in shops. Existing methods are not applicable...

Parallel edge-based visual assessment of cluster tendency on GPU

The visual assessment of (cluster) tendency (VAT) algorithm is an effective tool for investigating cluster tendency, which produces an intuitive image of matrix as the representation of complex datasets. The improved VAT (iVAT) incorporates a path-based distance metric into VAT to improve its effectiveness on complex-shaped datasets. The efficient formulation of the iVAT...

Big data and precision medicine: challenges and strategies with healthcare data

Recent snapshots of the European progress on big data in health care and precision medicine reveal diverse perceptions of experts and the public, leading to the impression that algorithmic issues have the largest share among the challenges all health systems are faced with. Yet, from a comparison of different countries it is evident that the adaption and integration of...

Fast causal inference with non-random missingness by test-wise deletion

Many real datasets contain values missing not at random (MNAR). In this scenario, investigators often perform list-wise deletion, or delete samples with any missing values, before applying causal discovery algorithms. List-wise deletion is a sound and general strategy when paired with algorithms such as FCI and RFCI, but the deletion procedure also eliminates otherwise good...

Scoring Bayesian networks of mixed variables

In this paper we outline two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables, that is, mixed variables. While much work has been done in the domain of automated Bayesian network learning, few studies have investigated this task in the presence of both continuous and discrete variables while focusing on scalability...

Sports analytics and the big-data era

The explosion of data, with large datasets that are available for analysis, has affected virtually every aspect of our lives. The sports industry has not been immune to these developments. In this article, we provide examples of three types of data-driven analyses that have been performed in the domain of sport: (a) field-level analysis focused on the behavior of athletes...

Maritime pattern extraction and route reconstruction from incomplete AIS data

Effective barge scheduling in the logistic domain requires advanced information on the availability of the port terminals and the maritime traffic in their vicinity. To enable a long-term prediction of vessel arrival times, we investigate how to use the publicly available automatic identification system (AIS) data to identify maritime patterns and transform them into a directed...

Scalable Twitter user clustering approach boosted by Personalized PageRank

Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may...

Deep learning for detecting inappropriate content in text

Today, there are a large number of online discussion fora on the internet which are meant for users to express, discuss and exchange their views and opinions on various topics. For example, news portals, blogs, social media channels such as youtube. typically allow users to express their views through comments. In such fora, it has been often observed that user conversations...

BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams

Supporting sequential pattern mining from data streams is nowadays a relevant problem in the area of data stream mining research. Actual proposals available in the literature are based on the well-known PrefixSpan approach and are, indeed, able to effectively bound the error of discovered patterns. This approach foresees the idea of dividing the target stream in a collection of...

Big Text advantages and challenges: classification perspective

Big Text, i.e., large repositories of textual data, is a part of Big Data. In total, 80–85 % of Big Text comes in unstructured form, with significant contribution from social media. In this position paper, we discuss Big Text advantages and challenges in respect to text classification. We propose a new approach to performance evaluation of classification algorithms when they...