# Knowledge and Information Systems

## List of Papers (Total 33)

#### Discovering topic structures of a temporally evolving document corpus

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of ...

#### Event stream-based process discovery using abstract representations

The aim of process discovery, originating from the area of process mining, is to discover a process model based on business process execution data. A majority of process discovery techniques relies on an event log as an input. An event log is a static source of historical data capturing the execution of a business process. In this paper, we focus on process discovery relying on ...

#### $L_p$ -Support vector machines for uplift modeling

Uplift modeling is a branch of machine learning which aims to predict not the class itself, but the difference between the class variable behavior in two groups: treatment and control. Objects in the treatment group have been subjected to some action, while objects in the control group have not. By including the control group, it is possible to build a model which predicts the ...

#### Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Modern data-driven systems often require classifiers capable of dealing with streaming imbalanced data and concept changes. The assessment of learning algorithms in such scenarios is still a challenge, as existing online evaluation measures focus on efficiency, but are susceptible to class ratio changes over time. In case of static data, the area under the receiver operating ...

#### Random indexing of multidimensional data

Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random ...

#### Context-dependent combination of sensor information in Dempster–Shafer theory for BDI

There has been much interest in the belief–desire–intention (BDI) agent-based model for developing scalable intelligent systems, e.g. using the AgentSpeak framework. However, reasoning from sensor information in these large-scale systems remains a significant challenge. For example, agents may be faced with information from heterogeneous sources which is uncertain and incomplete, ...

#### Partial materialization for online analytical processing over multi-tagged document collections

The New York Times Annotated Corpus, the ACM Digital Library, and PubMed are three prototypical examples of document collections in which each document is tagged with keywords or phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. ...

#### Clustering XML documents by patterns

Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML’s semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering ...

#### Model-based probabilistic frequent itemset mining

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World ...

#### Decision rules extraction from data stream in the presence of changing context for diabetes treatment

The knowledge extraction is an important element of the e-Health system. In this paper, we introduce a new method for decision rules extraction called Graph-based Rules Inducer to support the medical interview in the diabetes treatment. The emphasis is put on the capability of hidden context change tracking. The context is understood as a set of all factors affecting patient ...

#### Decision trees for uplift modeling with single and multiple treatments

Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases, some action such as mailing an offer or treating a patient is to be taken on the classified objects, and we should model not the class probabilities themselves, but instead, the change in class probabilities caused by the action. The action should then be ...

#### Correcting evaluation bias of relational classifiers with network cross validation

Recently, a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how ...

#### Data preprocessing techniques for classification without discrimination

Recently, the following Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many ...

#### A hybrid decision tree training method using data streams

Classical classification methods usually assume that pattern recognition models do not depend on the timing of the data. However, this assumption is not valid in cases where new data frequently become available. Such situations are common in practice, for example, spam filtering or fraud detection, where dependencies between feature values and class numbers are continually ...

#### Cluster-based instance selection for machine classification

Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in ...

#### Hybrid rules with well-founded semantics

A general framework is proposed for integration of rules and external first-order theories. It is based on the well-founded semantics of normal logic programs and inspired by ideas of Constraint Logic Programming (CLP) and constructive negation for logic programs. Hybrid rules are normal clauses extended with constraints in the bodies; constraints are certain formulae in the ...

#### Product selection for promotion planning

This paper addresses a very important question—how to select the right products to promote in order to maximize promotional benefit. We set up a framework to incorporate promotion decisions into the data-mining process, formulate the profit maximization problem as an optimization problem, and propose a heuristic search solution to discover the right products to promote. Moreover, ...

#### Semi-automated schema integration with SASMINT

The emergence of increasing number of collaborating organizations has made clear the need for supporting interoperability infrastructures, enabling sharing and exchange of data among organizations. Schema matching and schema integration are the crucial components of the interoperability infrastructures, and their semi-automation to interrelate or integrate heterogeneous and ...