# Data Science and Engineering

## List of Papers (Total 66)

#### Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage

The process of matching and integrating records that relate to the same entity from one or more datasets is known as record linkage, and it has become an increasingly important subject in many application areas, including business, government and health system. The data from these areas often contain sensitive information. To prevent privacy breaches, ideally records should be...

#### Guiding the Training of Distributed Text Representation with Supervised Weighting Scheme for Sentiment Analysis

With the rapid growth of social media, sentiment analysis has received growing attention from both academic and industrial fields. One line of researches for sentiment analysis is to feed bag-of-words (BOW) text representation into classifiers. Usually, raw BOW requires weighting schemes to obtain better performance, where important words are given more weights while unimportant...

#### Top-k Team Recommendation and Its Variants in Spatial Crowdsourcing

With the rapid development of mobile internet and online to offline marketing model, various spatial crowdsourcing platforms, such as Gigwalk and Gmission, are getting popular. Most existing studies assume that spatial crowdsourced tasks are simple and trivial. However, many real crowdsourced tasks are complex and need to be collaboratively finished by a team of crowd workers...

#### Model-Based Diversification for Sequential Exploratory Queries

Today, data exploration platforms are widely used to assist users in locating interesting objects within large volumes of scientific and business data. In those platforms, users try to make sense of the underlying data space by iteratively posing numerous queries over large databases. While diversification of query results, like other data summarization techniques, provides users...

Log replication is a key component in highly available database systems. In order to guarantee data consistency and reliability, it is common for modern database systems to utilize Paxos protocol, which is responsible for replicating transactional logs from one primary node to multiple backups. However, the Paxos replication needs to store and synchronize some additional metadata...

#### Context-Aware Recommendations with Random Partition Factorization Machines

Context plays an important role in helping users to make decisions. There are hierarchical structure between contexts and aggregation characteristics within the context in real scenarios. Exist works mainly focus on exploring the explicit hierarchy between contexts, while ignoring the aggregation characteristics within the context. In this work, we explore both of them so as to...

#### Investigating TSP Heuristics for Location-Based Services

Travel planning is one of the important issues in the location-based services (LBS). Traveling salesman problem (TSP) is to find the optimal tour that traverses points exactly once in the minimum total distance. Given the hardness of TSP (NP-hard), TSP query for a given set of points, $$Q$$, is not widely studied for online LBS, and the nearest-neighbor heuristic is the only...

#### Graph Partitioning for Distributed Graph Processing

There is a large demand for distributed engines that efficiently process large-scale graph data, such as social graph and web graph. The distributed graph engines execute analysis process after partitioning input graph data and assign them to distributed computers, so the quality of graph partitioning largely affects the communication cost and load balance among computers during...

#### Graph-Based RDF Data Management

The increasing size of RDF data requires efficient systems to store and query them. There have been efforts to map RDF data to a relational representation, and a number of systems exist that follow this approach. We have been investigating an alternative approach of maintaining the native graph model to represent RDF data, and utilizing graph database techniques (such as a...

#### Distance-Aware Selective Online Query Processing Over Large Distributed Graphs

Performing online selective queries against graphs is a challenging problem due to the unbounded nature of graph queries which leads to poor computation locality. It becomes even difficult when a graph is too large to be fit in the memory. Although there have been emerging efforts on managing large graphs in a distributed and parallel setting, e.g., Pregel, HaLoop and etc, these...

#### Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines

There are many large-scale graphs in real world such as Web graphs and social graphs. The interest in large-scale graph analysis is growing in recent years. Breadth-First Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one trillion vertices graph within half...

#### Big Graph Analyses: From Queries to Dependencies and Association Rules

This position paper provides an overview of our recent advances in the study of big graphs, from theory to systems to applications. We introduce a theory of bounded evaluability, to query big graphs by accessing a bounded amount of the data. Based on this, we propose a framework to query big graphs with constrained resources. Beyond queries, we propose functional dependencies for...

#### Local Weighted Matrix Factorization for Top-n Recommendation with Implicit Feedback

Item recommendation helps people to discover their potentially interested items among large numbers of items. One most common application is to recommend top-n items on implicit feedback datasets (e.g., listening history, watching history or visiting history). In this paper, we assume that the implicit feedback matrix has local property, where the original matrix is not globally...

#### Efficient Maximal Clique Enumeration Over Graph Data

In a wide variety of emerging data-intensive applications, such as social network analysis, Web document clustering, entity resolution, and detection of consistently co-expressed genes in systems biology, the detection of dense subgraphs (cliques) is an essential component. Unfortunately, this problem is NP-Complete and thus computationally intensive at scale—hence there is a...

#### Pre-computed Region Guardian Sets Based Reverse kNN Queries

Given a set of objects and a query q, a point p is q’s Reverse k Nearest Neighbour (RkNN) if q is one of p’s k-closest objects. RkNN queries have received significant research attention in the past few years. However, we realize that the state-of-the-art algorithm, SLICE, accesses many objects that do not contribute to its RkNN results when running the filtering phase, which...

#### An I/O-Efficient Buffer Batch Replacement Policy for Update-Intensive Graph Databases

With the proliferation of graph-based applications, such as social network management and Web structure mining, update-intensive graph databases have become an important component of today’s data management platforms. Several techniques have been recently proposed to exploit locality on both data organization and computational model in graph databases. However, little...

#### Homomorphic Pattern Mining from a Single Large Data Tree

Finding interesting tree patterns hidden in large datasets is a central topic in data mining with many practical applications. Unfortunately, previous contributions have focused almost exclusively on mining-induced patterns from a set of small trees. The problem of mining homomorphic patterns from a large data tree has been neglected. This is mainly due to the challenging...

#### Big Data Reduction Methods: A Survey

Research on big data analytics is entering in the new phase called fast data where multiple gigabytes of data arrive in the big data systems every second. Modern big data systems collect inherently complex data streams due to the volume, velocity, value, variety, variability, and veracity in the acquired data and consequently give rise to the 6Vs of big data. The reduced and...

#### A Practical Privacy-Preserving Recommender System

The main goal of a personalized recommender system is to provide useful recommendations on various items to the users. In order to generate recommendations, the service needs to access various types of user data such as previous product purchasing history, demographic and biographical information. However, users are sensitive to disclosure of personal information as it can be...

#### Time for Addressing Software Security Issues: Prediction Models and Impacting Factors

Finding and fixing software vulnerabilities have become a major struggle for most software development companies. While generally without alternative, such fixing efforts are a major cost factor, which is why companies have a vital interest in focusing their secure software development activities such that they obtain an optimal return on this investment. We investigate, in this...

#### Provenance for Wireless Sensor Networks: A Survey

In wireless sensor networks (WSNs), provenance records the data source, forwarding, and aggregating information of data packets on their way to the base station. Provenance is critical for assessing the trustworthiness of the received data, diagnosing network failures, detecting early signs of attacks, etc. However, because the provenance size expands rapidly with the increase in...

#### Efficient and Secure Storage for Outsourced Data: A Survey

With the growing popularity of cloud computing, more and more enterprises and individuals tend to store their sensitive data on the cloud in order to reduce the cost of data management. However, new security and privacy challenges arise when the data stored in the cloud due to the loss of data control by the data owner. This paper focuses on the techniques of verifiable data...