Data Science and Engineering

http://link.springer.com/journal/41019

List of Papers (Total 45)

Trust-based Modelling of Multi-criteria Crowdsourced Data

As a recommendation technique based on historical user information, collaborative filtering typically predicts the classification of items using a single criterion for a given user. However, many application domains can benefit from the analysis of multiple criteria, e.g. tourists usually rate attractions (hotels, attractions, restaurants, etc.) using multiple criteria. In this ...

Using GUHA Data Mining Method in Analyzing Road Traffic Accidents Occurred in the Years 2004–2008 in Finland

The suitability of the GUHA data mining method in analyzing a big data matrix is studied in this report in general, and, in particular, a data matrix containing more than 80,000 road traffic accidents occurred in Finland in 2004–2008 is examined by LISp-Miner, a software implementation of GUHA. The general principles of GUHA are first outlined, and then, the road accident data is ...

Big Data Management: What to Keep from the Past to Face Future Challenges?

The emergence of new hardware architectures, and the continuous production of data open new challenges for data management. It is no longer pertinent to reason with respect to a predefined set of resources (i.e., computing, storage and main memory). Instead, it is necessary to design data processing algorithms and processes considering unlimited resources via the “pay-as-you-go” ...

Optimal Compressed Sensing and Reconstruction of Unstructured Mesh Datasets

Exascale computing promises quantities of data too large to efficiently store and transfer across networks in order to be able to analyze and visualize the results. We investigate compressed sensing (CS) as an in situ method to reduce the size of the data as it is being generated during a large-scale simulation. CS works by sampling the data on the computational cluster within an ...

Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage

The process of matching and integrating records that relate to the same entity from one or more datasets is known as record linkage, and it has become an increasingly important subject in many application areas, including business, government and health system. The data from these areas often contain sensitive information. To prevent privacy breaches, ideally records should be ...

Guiding the Training of Distributed Text Representation with Supervised Weighting Scheme for Sentiment Analysis

With the rapid growth of social media, sentiment analysis has received growing attention from both academic and industrial fields. One line of researches for sentiment analysis is to feed bag-of-words (BOW) text representation into classifiers. Usually, raw BOW requires weighting schemes to obtain better performance, where important words are given more weights while unimportant ...

Low-Overhead Paxos Replication

Log replication is a key component in highly available database systems. In order to guarantee data consistency and reliability, it is common for modern database systems to utilize Paxos protocol, which is responsible for replicating transactional logs from one primary node to multiple backups. However, the Paxos replication needs to store and synchronize some additional metadata, ...

Top-k Team Recommendation and Its Variants in Spatial Crowdsourcing

With the rapid development of mobile internet and online to offline marketing model, various spatial crowdsourcing platforms, such as Gigwalk and Gmission, are getting popular. Most existing studies assume that spatial crowdsourced tasks are simple and trivial. However, many real crowdsourced tasks are complex and need to be collaboratively finished by a team of crowd workers with ...

Model-Based Diversification for Sequential Exploratory Queries

Today, data exploration platforms are widely used to assist users in locating interesting objects within large volumes of scientific and business data. In those platforms, users try to make sense of the underlying data space by iteratively posing numerous queries over large databases. While diversification of query results, like other data summarization techniques, provides users ...

Context-Aware Recommendations with Random Partition Factorization Machines

Context plays an important role in helping users to make decisions. There are hierarchical structure between contexts and aggregation characteristics within the context in real scenarios. Exist works mainly focus on exploring the explicit hierarchy between contexts, while ignoring the aggregation characteristics within the context. In this work, we explore both of them so as to ...

Investigating TSP Heuristics for Location-Based Services

Travel planning is one of the important issues in the location-based services (LBS). Traveling salesman problem (TSP) is to find the optimal tour that traverses points exactly once in the minimum total distance. Given the hardness of TSP (NP-hard), TSP query for a given set of points, \(Q\), is not widely studied for online LBS, and the nearest-neighbor heuristic is the only ...

Graph Partitioning for Distributed Graph Processing

There is a large demand for distributed engines that efficiently process large-scale graph data, such as social graph and web graph. The distributed graph engines execute analysis process after partitioning input graph data and assign them to distributed computers, so the quality of graph partitioning largely affects the communication cost and load balance among computers during ...

Graph-Based RDF Data Management

The increasing size of RDF data requires efficient systems to store and query them. There have been efforts to map RDF data to a relational representation, and a number of systems exist that follow this approach. We have been investigating an alternative approach of maintaining the native graph model to represent RDF data, and utilizing graph database techniques (such as a ...

Distance-Aware Selective Online Query Processing Over Large Distributed Graphs

Performing online selective queries against graphs is a challenging problem due to the unbounded nature of graph queries which leads to poor computation locality. It becomes even difficult when a graph is too large to be fit in the memory. Although there have been emerging efforts on managing large graphs in a distributed and parallel setting, e.g., Pregel, HaLoop and etc, these ...

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

There are many large-scale graphs in real world such as Web graphs and social graphs. The interest in large-scale graph analysis is growing in recent years. Breadth-First Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half a ...

Big Graph Analyses: From Queries to Dependencies and Association Rules

This position paper provides an overview of our recent advances in the study of big graphs, from theory to systems to applications. We introduce a theory of bounded evaluability, to query big graphs by accessing a bounded amount of the data. Based on this, we propose a framework to query big graphs with constrained resources. Beyond queries, we propose functional dependencies for ...

Local Weighted Matrix Factorization for Top-n Recommendation with Implicit Feedback

Item recommendation helps people to discover their potentially interested items among large numbers of items. One most common application is to recommend top-n items on implicit feedback datasets (e.g., listening history, watching history or visiting history). In this paper, we assume that the implicit feedback matrix has local property, where the original matrix is not globally ...

Efficient Maximal Clique Enumeration Over Graph Data

In a wide variety of emerging data-intensive applications, such as social network analysis, Web document clustering, entity resolution, and detection of consistently co-expressed genes in systems biology, the detection of dense subgraphs (cliques) is an essential component. Unfortunately, this problem is NP-Complete and thus computationally intensive at scale—hence there is a need ...

Pre-computed Region Guardian Sets Based Reverse kNN Queries

Given a set of objects and a query q, a point p is q’s Reverse k Nearest Neighbour (RkNN) if q is one of p’s k-closest objects. RkNN queries have received significant research attention in the past few years. However, we realize that the state-of-the-art algorithm, SLICE, accesses many objects that do not contribute to its RkNN results when running the filtering phase, which ...

An I/O-Efficient Buffer Batch Replacement Policy for Update-Intensive Graph Databases

With the proliferation of graph-based applications, such as social network management and Web structure mining, update-intensive graph databases have become an important component of today’s data management platforms. Several techniques have been recently proposed to exploit locality on both data organization and computational model in graph databases. However, little investigation ...

Homomorphic Pattern Mining from a Single Large Data Tree

Finding interesting tree patterns hidden in large datasets is a central topic in data mining with many practical applications. Unfortunately, previous contributions have focused almost exclusively on mining-induced patterns from a set of small trees. The problem of mining homomorphic patterns from a large data tree has been neglected. This is mainly due to the challenging unbounded ...

Big Data Reduction Methods: A Survey

Research on big data analytics is entering in the new phase called fast data where multiple gigabytes of data arrive in the big data systems every second. Modern big data systems collect inherently complex data streams due to the volume, velocity, value, variety, variability, and veracity in the acquired data and consequently give rise to the 6Vs of big data. The reduced and ...

Time for Addressing Software Security Issues: Prediction Models and Impacting Factors

Finding and fixing software vulnerabilities have become a major struggle for most software development companies. While generally without alternative, such fixing efforts are a major cost factor, which is why companies have a vital interest in focusing their secure software development activities such that they obtain an optimal return on this investment. We investigate, in this ...