Data Science and Engineering

http://link.springer.com/journal/41019

List of Papers (Total 59)

Correction to: Guiding the Training of Distributed Text Representation with Supervised Weighting Scheme for Sentiment Analysis

In the originally published article, the acknowledgment section is missing. Please find it as follows.

Correction to: Query Optimal k-Plex Based Community in Graphs

In the initial publication, first name and family name of the second author Xun Jian were switched around. The original article has been corrected.

Data Privacy Protection Mechanisms in Cloud

In the cloud computing environment, the privacy of the electronic data is a serious issue that requires special considerations. We have presented a state-of-the-art review of the methodologies and approaches that are currently being used to cope with the significant issue of privacy. We have categorized the privacy-preserving approaches into four categories, i.e., privacy by ...

Keyphrase Extraction Using Knowledge Graphs

Extracting keyphrases from documents automatically is an important and interesting task since keyphrases provide a quick summarization for documents. Although lots of efforts have been made on keyphrase extraction, most of the existing methods (the co-occurrence-based methods and the statistic-based methods) do not take semantics into full consideration. The co-occurrence-based ...

Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data ...

Reordering Transaction Execution to Boost High-Frequency Trading Applications

High-frequency trading (HFT) has always been welcomed because it benefits not only personal benefits but also the whole social welfare. While the recent advance of portfolio selection in HFT market enables to bring about more profit, it yields much contended OLTP workloads. Featuring exploiting the abundant parallelism, transaction pipeline, the state-of-the-art concurrency control ...

Sliding Window Top-K Monitoring over Distributed Data Streams

Most of the traditional top-k algorithms are based on a single-server setting. They may be highly inefficient and/or cause huge communication overhead when applied to a distributed system environment. Therefore, the problem of top-k monitoring in distributed environments has been intensively investigated recently. This paper studies how to monitor the top-k data objects with the ...

Query Optimal k-Plex Based Community in Graphs

Community search problem, which is to find good communities given a set of query nodes in a graph, has attracted increasing research interest recently. Though various measurement models have been proposed to define and solve community search problem. Few of them could define a community concisely and have good quality of query results. They either involve additional constraints for ...

A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support

Clinical Decision Support (CDS) is widely seen as an information retrieval (IR) application in the medical domain. The goal of CDS is to help physicians find useful information from a collection of medical articles with respect to the given patient records, in order to take the best care of their patients. Most of the existing CDS methods do not sufficiently consider the semantic ...

Tracking Time Evolving Data Streams for Short-Term Traffic Forecasting

Data streams have arisen as a relevant topic during the last few years as an efficient method for extracting knowledge from big data. In the robust layered ensemble model (RLEM) proposed in this paper for short-term traffic flow forecasting, incoming traffic flow data of all connected road links are organized in chunks corresponding to an optimal time lag. The RLEM model is ...

A Review of Scalable Bioinformatics Pipelines

Scalability is increasingly important for bioinformatics analysis services, since these must handle larger datasets, more jobs, and more users. The pipelines used to implement analyses must therefore scale with respect to the resources on a single compute node, the number of nodes on a cluster, and also to cost-performance. Here, we survey several scalable bioinformatics pipelines ...

Trust-based Modelling of Multi-criteria Crowdsourced Data

As a recommendation technique based on historical user information, collaborative filtering typically predicts the classification of items using a single criterion for a given user. However, many application domains can benefit from the analysis of multiple criteria, e.g. tourists usually rate attractions (hotels, attractions, restaurants, etc.) using multiple criteria. In this ...

Using GUHA Data Mining Method in Analyzing Road Traffic Accidents Occurred in the Years 2004–2008 in Finland

The suitability of the GUHA data mining method in analyzing a big data matrix is studied in this report in general, and, in particular, a data matrix containing more than 80,000 road traffic accidents occurred in Finland in 2004–2008 is examined by LISp-Miner, a software implementation of GUHA. The general principles of GUHA are first outlined, and then, the road accident data is ...

Big Data Management: What to Keep from the Past to Face Future Challenges?

The emergence of new hardware architectures, and the continuous production of data open new challenges for data management. It is no longer pertinent to reason with respect to a predefined set of resources (i.e., computing, storage and main memory). Instead, it is necessary to design data processing algorithms and processes considering unlimited resources via the “pay-as-you-go” ...

Optimal Compressed Sensing and Reconstruction of Unstructured Mesh Datasets

Exascale computing promises quantities of data too large to efficiently store and transfer across networks in order to be able to analyze and visualize the results. We investigate compressed sensing (CS) as an in situ method to reduce the size of the data as it is being generated during a large-scale simulation. CS works by sampling the data on the computational cluster within an ...

Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage

The process of matching and integrating records that relate to the same entity from one or more datasets is known as record linkage, and it has become an increasingly important subject in many application areas, including business, government and health system. The data from these areas often contain sensitive information. To prevent privacy breaches, ideally records should be ...

Guiding the Training of Distributed Text Representation with Supervised Weighting Scheme for Sentiment Analysis

With the rapid growth of social media, sentiment analysis has received growing attention from both academic and industrial fields. One line of researches for sentiment analysis is to feed bag-of-words (BOW) text representation into classifiers. Usually, raw BOW requires weighting schemes to obtain better performance, where important words are given more weights while unimportant ...

Low-Overhead Paxos Replication

Log replication is a key component in highly available database systems. In order to guarantee data consistency and reliability, it is common for modern database systems to utilize Paxos protocol, which is responsible for replicating transactional logs from one primary node to multiple backups. However, the Paxos replication needs to store and synchronize some additional metadata, ...

Top-k Team Recommendation and Its Variants in Spatial Crowdsourcing

With the rapid development of mobile internet and online to offline marketing model, various spatial crowdsourcing platforms, such as Gigwalk and Gmission, are getting popular. Most existing studies assume that spatial crowdsourced tasks are simple and trivial. However, many real crowdsourced tasks are complex and need to be collaboratively finished by a team of crowd workers with ...

Model-Based Diversification for Sequential Exploratory Queries

Today, data exploration platforms are widely used to assist users in locating interesting objects within large volumes of scientific and business data. In those platforms, users try to make sense of the underlying data space by iteratively posing numerous queries over large databases. While diversification of query results, like other data summarization techniques, provides users ...

Context-Aware Recommendations with Random Partition Factorization Machines

Context plays an important role in helping users to make decisions. There are hierarchical structure between contexts and aggregation characteristics within the context in real scenarios. Exist works mainly focus on exploring the explicit hierarchy between contexts, while ignoring the aggregation characteristics within the context. In this work, we explore both of them so as to ...