On the properties of von Neumann kernels for link analysis

Machine Learning, Apr 2009

We study the effectiveness of Kandola et al.’s von Neumann kernels as a link analysis measure. We show that von Neumann kernels subsume Kleinberg’s HITS importance at the limit of their parameter range. Because they reduce to co-citation relatedness at the other end of the parameter, von Neumann kernels give us a spectrum of link analysis measures between the two established measures of importance and relatedness. Hence the relative merit of a vertex can be evaluated in terms of varying trade-offs between the global importance and the local relatedness within a single parametric framework. As a generalization of HITS, von Neumann kernels inherit the problem of topic drift. When a graph consists of multiple communities each representing a different topic, HITS is known to rank vertices in the most dominant community higher regardless of the query term. This problem persists in von Neumann kernels; when the parameter is biased towards the direction of global importance, they tend to rank vertices in the dominant community uniformly higher irrespective of the community of the seed vertex relative to which the ranking is computed. To alleviate topic drift, we propose to use of a PLSI-based technique in combination with von Neumann kernels. Experimental results on a citation network of scientific papers demonstrate the characteristics and effectiveness of von Neumann kernels.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10994-008-5090-6.pdf

On the properties of von Neumann kernels for link analysis

Masashi Shimbo 0 Takahiko Ito 0 Daichi Mochihashi 0 Yuji Matsumoto 0 Editors: Thomas Grtner and Gemma C. Garriga. 0 Present address: T. Ito FAST, a Microsoft Subsidiary, Daido-Seimei Kasumigaseki, 1-4-2 Kasumigaseki, Chiyoda-ku, Tokyo 100-0013, Japan We study the effectiveness of Kandola et al.'s von Neumann kernels as a link analysis measure. We show that von Neumann kernels subsume Kleinberg's HITS importance at the limit of their parameter range. Because they reduce to co-citation relatedness at the other end of the parameter, von Neumann kernels give us a spectrum of link analysis measures between the two established measures of importance and relatedness. Hence the relative merit of a vertex can be evaluated in terms of varying trade-offs between the global importance and the local relatedness within a single parametric framework. As a generalization of HITS, von Neumann kernels inherit the problem of topic drift. When a graph consists of multiple communities each representing a different topic, HITS is known to rank vertices in the most dominant community higher regardless of the query term. This problem persists in von Neumann kernels; when the parameter is biased towards the direction of global importance, they tend to rank vertices in the dominant community uniformly - This work was carried out while T. Ito was a Ph.D. student at Nara Institute of Science and Technology. higher irrespective of the community of the seed vertex relative to which the ranking is computed. To alleviate topic drift, we propose to use of a PLSI-based technique in combination with von Neumann kernels. Experimental results on a citation network of scientific papers demonstrate the characteristics and effectiveness of von Neumann kernels. 1 Introduction PageRank (Brin and Page 1998) and HITS (Hypertext-Induced Topic Search) (Kleinberg 1999) are two of the most popular methods for evaluating the importance of web pages. They are used extensively in search engines to compute the ranking of web pages from the network structure induced by hyperlinks. Even in the early days of bibliometric studies, links (or citations) between documents were already recognized as a major source of information for analyzing the relationship between scientific papers, authors, and journals. A line of research in bibliometrics have dealt with quantifying the relatedness of two given documents. Co-citation coupling (Small 1973) is one such classical measure of relatedness still widely used, for example, by the popular scientific literature database CiteSeer (Bollacker et al. 1998) to recommend related papers to the user. Discussion of these two lines of link analysis measures, importance and relatedness, has remained somewhat independent in the past. The first objective of this paper is to present a unified framework that accounts for both importance and relatedness, and to further define measures intermediate between the two. Our approach is based on Kandola et al.s von Neumann kernel (Kandola et al. 2003). We show that this kernel nicely bridges the gap between importance and relatedness. Specifically, it subsumes not only the co-citation and bibliographic coupling (Kessler 1963) relatedness at one end of the parameter range, but also the HITS importance as the other extreme. Between these established relatedness and importance measures lies a spectrum of intermediate link analysis measures, all of which can be obtained by tuning a single parameter of the kernel. von Neumann kernels thus provide an attractive framework for link analysis, but there are some difficulties in their practical application. In Sect. 4 of this paper, we discuss one of these difficulties, known as topic drift (Bharat and Henzinger 1998). This problem is noticeable when a graph consists of multiple communities each addressing a different topic; if von Neumann kernels are applied to such a graph, with the parameter biased towards importance, they assign the highest scores to the vertices in the dominant community irrespective of the seed vertex, the vertex relative to which the ranking is computed. We propose a method for avoiding topic drift in von Neumann kernels. To this end, we model the generative process of citations borrowing the idea from Cohn and Changs PHITS (Cohn and Chang 2000), and construct distinct graphs for individual communities. All these community graphs have the same vertex set as the original graph, but their edges are reweighted according to the generative probability of the corresponding citation in the respective community. Applying von Neumann kernels to the community graphs allows us to take communities into consideration even when they are biased towards importance. In Sect. 5, we discuss some kernels and ranking methods related to von Neumann kernels. We also discuss the connection between the community-based von Neumann kernels we propose and Hofmanns PLSI-based Fisher kernels (Hofmann 2000). This paper is an integrated and extended version of the work that appeared in (Ito et al. 2005, 2006; Shimbo and Ito 2006), with a focus on von Neumann kernels. The recommendation system simulation in Sect. 6.3 is substantially updated from the preliminary evaluation presented in (Shimbo et al. 2007), with the evaluation criterion modified to be more realistic, and more kernel-based ranking methods are compared. 2 Importance and relatedness measures in link analysis Most of the existing link analysis measures can be classified into one of the two types:1 relatedness or importance. Relatedness measures quantify the similarity of two vertices in a graph, or the relevance of one vertex to another. Importance, on the other hand, is the measure for ranking a given group of vertices in the order of their significance, impact, or popularity within the group. In this section, we review several link analysis methods of relatedness and importance that are relevant to subsequent discussions, For more comprehensive surveys on link analysis, see (Baldi et al. 2003, Chap. 5) and (Dhyani et al. 2002). We assume that the vertices of a graph are indexed by natural numbers, and identify a vertex with its index. We also use vi to denote the ith vertex. Throughout this paper, matrices are denoted by uppercase letters, and vectors are denoted by boldface letters. The symbol T denotes matrix transposition. All matrices are square and all vectors are column vectors unless noted otherwise. For integers i and j , [A]ij represents the (i, j )-element of a matrix A, and [v]i represents the ith component of vector v. We denote by 1 the vector of all 1s, and by ei the unit vector with 1 at the ith component and 0 at the remaining components. I is an identity matrix, and for any matrix A, let (A) be its spectral radius, i.e., the largest modulus of eigenvalues of A. We consider both non-weighted and weighted graphs, but in all cases, edge weights are positive. For non-weighted graphs, all edges weights are assumed to be 1. The adjacency matrix A of a graph is a matrix with elements [A]ij holding the weight of the edge from vertex i to vertex j if such an edge exists, and 0 if not. A graph is undirected if for any pair of vertices (i, j ), [A]ij = [A]ji ; otherwise it is directed. Thus, the adjacency matrix is symmetric for undirected graphs. For a path in a graph, its weight is defined as the product of the weights of all edges composing the path, and its length is the (cumulative) number of these edges. The distance between vertices is the minimum length of paths between them. 2.1 Relatedness: co-citation and bibliographic coupling Link analysis assumes that in a graph modeling a network structure such as bibliographic citations and the web, an edge (e.g., a citation or a hyperlink) between a pair of vertices (papers or web pages) signifies that these vertices are in some sense related. Hence the degree of relatedness can be inferred from the vertex proximity induced by the existence of edges. It is however not assumed that all related vertices are spelled out by those connected by edges, because edges can be omitted for various reasons. The challenge of link analysis is to capture relationship between vertices even if they are not directly connected. Co-citation (Small 1973) and bibliographic coupling (Kessler 1963) are the standard methods of computing relatedness between documents in a citation network, or, more generally, between vertices in a graph. 1One notable exception of this classification is White and Smyths relative importance (White and Smyth 2003) we will discuss in Sect. 5. Co-citation coupling defines relatedness between documents as the number of other documents citing them both. And bibliographic coupling defines relatedness between two documents as the number of common references cited by the two. These measures can be calculated from the adjacency matrix of a citation graph. Given an adjacency matrix A, the number of co-citations between vertices i and j is given by the (i, j )-element of co-citation matrix ATA. Similarly, bibliographic coupling matrix AAT gives the values of bibliographic coupling. Because these matrices are symmetric, their graph counterparts, the co-citation graph and bibliographic coupling graph, are undirected, even if the original citation graph is not. 2.2 Importance: Kleinbergs HITS Because of the difficulty in computing the importance of documents from their contents, citation counts (or, in-degree of vertices) have long been used as the index of document importance. A support for this approach was provided by several researchers; even though citations are made for various reasons, a positive correlation was observed between the number of citations and the significance or impact of the cited work. See the reference (le Pair 1988) for the list of literature on this topic. The recently proposed SALSA algorithm (Lempel and Moran 2001) can be viewed as providing a justification for using the vertex in-degree as an importance index, from a random walk point of view. Kleinbergs HITS (Hypertext-Induced Topic Search) (Kleinberg 1999), is a more recent and sophisticated method for evaluating document importance. HITS assigns two scores to each document (vertex), called the authority and hub scores. The assumption behind HITS is the presence of mutual reinforcement between authorities and hubs. That is, authoritative documents are the ones that are cited by many hub documents, and hub documents are the ones that are citing many authorities. Let A be the adjacency matrix of a graph. The HITS algorithm computes the following recursion over n = 0, 1, . . . starting from a(0) = h(0) = 1. a(n+1) = h(n+1) = The ith component of the authority vector limn a(n) represents the authority score of vertex i. Similarly, the hub vector limn h(n) gives the hub scores. It is well known that the above recursion reduces to the power method for computing dominant eigenvectors. If ATA and2 AAT have a unique dominant eigenvalue, the authority and hub vectors exist and equal to the dominant eigenvectors of ATA and AAT, respectively. 3 von Neumann kernels as a unified link analysis measure In this section, we present a formulation of link analysis measures that are an intermediate between importance of vertices and their relatedness. The concept of an intermediate between importance and relatedness might sound illdefined, since importance is a measure defined on individual vertices, whereas relatedness is defined between them. However, given an importance score vector v such as the HITS authority vector, vvT defines a matrix in which every row (and column) i gives a ranking 2ATA and AAT have the same set of eigenvalues. of vertices identical to the one given by v, provided that3 [v]i =0. Importance can thus be treated as a function over a pair of vertices, or a matrix, as well. The basis of our formulation is the von Neumann kernel due to Kandola et al. (2003). This kernel is defined over graph vertices, and is symmetric positive semidefinite. In other words, it provides a Gram matrix holding inner products of vertices of a graph. 3.1 von Neumann kernels The von Neumann kernels (Kandola et al. 2003) were first proposed as a method for computing document similarity from terms occurring in documents, in a manner akin to the latent semantic analysis (Deerwester et al. 1990). The von Neumann kernels are defined in terms of the term-by-document rectangular matrix X whose (i, j )-element is the (possibly reweighted) frequency of the ith term occurring in document j . We first need the document correlation matrix K = XTX and term correlation matrix G = XXT. The von Neumann kernel matrices are then defined in terms of K and G as follows. Definition 1 Let X be a term-by-document matrix, and let K = XTX and G = XXT. The von Neumann kernel matrices with diffusion factor ( 0), denoted by K and G , are defined as the solution to the following system of equations. Shawe-Taylor and Cristianini (2004) presented an equivalent Neumann series representation of these kernels. n=0 n=0 3.2 Link analysis with von Neumann kernels Using the adjacency matrix A of a citation graph in place of the term-by-document matrix X, we obtain the von Neumann kernels based on citation information. We thus have K = ATA and G = AAT, which coincide with the co-citation and bibliographic coupling matrices, respectively. For convenience, let us introduce the shorthand n=0 The von Neumann kernels based on the adjacency matrix A can then be written as n=0 3The assumption of [v]i =0 is weaker than it appears. For example, HITS authority vectors satisfy [v]i =0 for every vertex i whenever the co-citation graph is connected. n=0 As seen from (6) and (7), G can be obtained simply by transposing A in (6). We therefore focus on K = N (ATA) below. 3.3 Interpretation Equation (6) shows that the von Neumann kernel matrix N (ATA) is a weighted sum of (ATA)n over n = 1, 2, . . .. As we can interpret [ATA]ij as the multiplicity of edges between vertices i and j in the corresponding co-citation graph, [(ATA)n]ij represents the number of paths of length n between vertices i and j in the same graph. Thus, we see that each element of the kernel matrix equals the weighted sum of the number of paths between vertices.4 Each term (ATA)n permits further interpretation as follows. First, it is easy to see that (ATA)n with small n captures the degree of proximity between any two vertices. The number [(ATA)n]ij of paths between i and j in the co-citation graph is non-zero only if the is a path of length less than or equal to n between them in the co-citation graph.5 Since relatedness is a measure based on proximity as we argued in Sect. 2.1, [(ATA)n]ij can be interpreted as indicating the relatedness between vertices i and j . In particular, (ATA)1 is exactly the co-citation matrix. For larger n values, we have the following theorem. Theorem 1 Let > 0 be the dominant eigenvalue of a symmetric positive semidefinite matrix ATA. If is a simple eigenvalue (i.e., has a multiplicity of one), there exists a unit eigenvector v corresponding to such that (ATA/)n vvT as n . Proof See Appendix A. If ATA is nonnegative, it has a nonnegative dominant eigenvector. This fact, together with the above theorem, leads to the following corollary. Corollary 1 Let the dominant eigenvalue of a nonnegative symmetric positive semidefinite matrix ATA be nonzero and simple, and let v be a non-negative eigenvector corresponding to . For any vertex i with [v]i =0 and any vertex pair (j, k) such that [v]j > [v]k , there exists an integer m satisfying [(ATA)n]ij > [(ATA)n]ik for all n m. Thus, if the co-citation graph (whose adjacency matrix is ATA) is connected, the dominant eigenvector v of ATA is equal to the HITS authority vector up to scaling. In this case, Corollary 1 tells us that if we regard a row (or column) vector of (ATA)n as a score vector and compute the rankings from the magnitude of its components, the rankings tend to the HITS authority rankings as n , irrespective of the chosen row (or column). In other 4More generally, if A holds arbitrary non-negative elements, [ATA]ij represents the sum of the weight of all paths between vertices i and j in the co-citation graph. 5For general graphs, [(ATA)n]ij is non-zero if and only if a path of length exactly n exists between vertices i and j . However, in the co-citation graph induced by ATA, all vertices with a non-zero in-degree in the original citation graph will have a self-loop, and hence there is always a path of length n if there is a path of length less than n between two vertices. words, the number of paths of length n between i and j is an indicator of the importance of these vertices, provided that n is sufficiently large. To sum up, we see that summing (ATA)n over n = 1, 2, . . . as in (6) can be interpreted as the mixture of relatedness (when n is small) and importance (when n is large). As a special case, the von Neumann kernels subsume co-citation at = 0. Near the opposite extreme of the parameter range, i.e., 1/ (ATA), the ranking induced by the von Neumann kernels is also identical to the HITS importance, as stated by the following theorem. Theorem 2 Let > 0 be the dominant eigenvalue of a symmetric positive semidefinite matrix ATA. If is a simple eigenvalue, there exists a unit eigenvector v corresponding to such that where the limit is taken from below. The proof is again deferred to Appendix A. A similar argument can be made for the relationship between bibliographic coupling, HITS hub vector (eigenvector of AAT), and the von Neumann kernel G = N (AAT). 3.4 A few notes on application We conclude this section with a few facts that can help applying von Neumann kernels in practice. Parameter sensitivity analysis The derivative of the von Neumann kernels can be analytically computed at a given point by the following simple equation. This can be exploited for, for instance, parameter tuning and parameter sensitivity analysis. Reducing memory requirement If we are concerned with the importance of vertices relative to a single vertex i rather than the entire kernel matrix, or if the entire kernel matrix cannot be kept on memory, we can reduce the space requirement by summing ( ATA)nei over n = 1, 2, . . . until convergence to obtain the ith column of the von Neumann kernel matrix; recall that ei is a unit vector with only 1 at the ith component. This way, an iteration consists only of one matrix-vector multiplication similar to the HITS computation, and one vector summation. 4 Topic drift and von Neumann kernels In this section, we discuss a major shortcoming of von Neumann kernels known as topic drift (Bharat and Henzinger 1998), and propose a solution for this problem. Topic drift is a phenomenon first observed with HITS. If applied to a graph with multiple communities,6 HITS assigns the highest scores to the vertices (documents) in the dominant 6To the best of our knowledge, there seems to be no consensus on what constitutes a community in the literature. A community in this paper merely means a set of documents in a citation network citing more documents within the community than those outside. Fig. 1 A citation graph with multiple communities, and the induced co-citation graph community of the graph. It follows that search engines using HITS as the back-end may output documents unrelated to the users query, if the query is different from the topic of the dominant community. Although it was first reported in the context of search engines which assume queries are terms, it is not related to the form of queries; topic drift can be observed even in a pure link analysis setting. In the following example, we assume that the query is posed as a vertex (seed vertex) in a given graph, and the rankings of other vertices relative to this seed vertex are to be computed. Consider the graph of Fig. 1(a) containing two communities. The HITS authority rankings for the vertices of this graph are (from the most important to the least important) v1 > v2 > v3 > v4 > v5 > v6. Notice that the ranks of documents in community 2, namely v4, v5 and v6, are uniformly lower than those of documents in community 1 (v1 and v2). As von Neumann kernels reduce to HITS when 1/ = 1/ (ATA), they also inherit the topic drift problem from HITS. For convenience, let us rewrite the formula (5) for von Neumann kernels using a normalized parameter = (ATA) which falls into the range 0 < 1 instead of 0 < 1/ (ATA). Let B = ATA. n=0 Figure 1(b) depicts the co-citation graph induced by the graph of Fig. 1(a). The von Neumann kernel matrix N 0.99 for this graph is shown below. Only the sub-matrix of the kernel matrix for documents v1 through v6 is shown. The remaining rows and columns are for isolated vertices in Fig. 1(b), and thus their elements are constantly 0 regardless of . The (i, j )-element of this matrix represents the importance of the j th document relative to the ith document. For example, relative to document v3 (third row), document v1 is the most important document because [N 0.99]3,1 = 127.64 is the largest in the third row. Now let us focus on v6 which belongs to community 2. The ranking for v6 is given by the sixth row in the matrix, and again, v1 has the largest value. However, this ranking for v6 is different from our intuition; because v1 and v6 belong to different communities, it would be more natural if the importance score of v4, a vertex in the same community as v6, were higher than v1. If we increase the diffusion factor (e.g., = 0.999), the ranking relative to each document will eventually be identical to that of the HITS authority ranking and determined irrespective of the community where the document belongs. To prevent the HITS importance rankings from diverging from the topic (or community) the user is interested in, Kleinberg (1999) proposed to first extract documents that include query terms, and apply HITS to the neighbor graph, which is a subgraph of the extracted documents and their neighbors. We may also apply von Neumann kernels to a neighbor graph of documents containing query terms, but this solution requires the query be terms and the document contents be accessible. Depending on the type of data (e.g., citation networks, where paper contents are often not available), it is not always possible to meet this requirement. Moreover, Bharat and Henzinger (1998) pointed out that there can be a discrepancy between queries and the topics of high-ranked documents, even if HITS is applied to a neighbor graph. 4.1 Generative model of citations In this section, we propose a method for reducing topic drift in von Neumann kernels. Although it uses a technique based on Probabilistic Latent Semantic Indexing (PLSI) (Hofmann 1999, 2001), document contents are not required. PLSI is a method for modeling the generative process of documents. In PLSI, the probability of word wj occurring in document di is modeled by the following equation. t=1 p(di , wj ) = p(t )p(di |t )p(wj |t ), where t = 1, 2, . . . , k represents a latent topic of documents. This model of PLSI, also called aspect model, assumes that words and documents are conditionally independent given latent topic t . Parameters p(t ), p(di |t ) and p(wj |t ) are estimated from empirical observations with a variant of the Expectation Maximization algorithm. Cohn and Chang (2000) modeled the generative process of citations analogously to PLSI. In their model, the probability of a citation is defined by using citations in place of words in (11). Thus, the probability of a document i citing another document j is p(di , cj ) = t=1 p(t )p(di |t )p(cj |t ), where di represents a citation emanating from document i, and cj represents a citation to document j . They also proposed to use p(cj |t ), the generative probability of a citation within the community addressing topic t , as the importance of document j in this community. Cohn and Chang tested this method, called Probabilistic HITS, or PHITS for short, on the Cora citation dataset, and demonstrated that it gives intuitive rankings of scientific papers in each inferred community t . PHITS thus gives the importance rankings in individual communities, but it is not obvious how to compute other quantities provided by von Neumann kernels; i.e., relatedness between two documents, and importance of documents relative to a given document. Below, we exploit the model of (12) in combination with von Neumann kernels to compute these quantities. 4.2 Applying von Neumann kernels to community graphs To alleviate topic drift in von Neumann kernels, we use the generative model of citations given by (12), and apply von Neumann kernels to the weighted graphs induced by the model. Unlike Kleinbergs neighbor graph method, document contents are not used to derive these community graphs. The proposed method maintains the property of von Neumann kernels as a mixture of importance and relatedness, and also takes communities into consideration, even when they are biased towards importance. This method takes the number of communities k as input, and consists of the following four steps. p(t |di , cj ) = p(di , cj |t )p(t ) p(di |t )p(cj |t )p(t ) Step 2 Create each community graph Gt for t = 1, . . . , n, with the same vertex set as the original citation graph, but assign [At ]ij = p(t |di , cj )[A]ij as the weight of the edge from vertex i to j . Here, A and At are the adjacency matrices of the original citation graph and the t th community graph, respectively. The intuition here is to distribute the original edge weights to each community graph t in proportion to the probability p(t |di , cj ). Notice tk=1 p(t |di , cj ) = 1. Step 3 For each t , apply von Neumann kernels to the co-citation matrix AtTAt . The resulting von Neumann kernel for community t is n=0 This equation is identical to (6), except that At is used in place of the adjacency matrix A of the original citation graph, and the diffusion parameter is rescaled as . Finally, sum the von Neumann kernels in (13) over all communities t to obtain t=1 This matrix N comm retains positive semidefiniteness, because the sum of positive semidefi nite kernels is still a positive semidefinite kernel (Haussler 1999). 4.3 Illustration We demonstrate the three steps of our proposed method (Sect. 4.2) with the graph of Fig. 1(a). Let the parameter for the von Neumann kernel be 0.99, and the hyperparameter (the number of latent communities) k be 2 in PLSI. In Steps 1 and 2, we apply PLSI to Fig. 1(a) and obtain the two community graphs shown in Fig. 2(a) and (b). In Step 3, von Neumann kernels are applied to each community graph. Notice that vertices v1 and v2, the most important vertices in terms of the HITS authority ranking, are not connected to v4, v5 or v6 in the two community graphs. Hence the scores for v1 and v2 relative to the latter vertices are 0 in the respective kernels. In Step 4, we compute (14), which is the sum of the von Neumann kernel matrices for communities 1 and 2. This yields the final kernel N comm. At = 0.99, the resulting kernel matrix is as follows. The rankings in the 4th to 6th rows are significantly different from those of N 0.99 (see (10)). In the 6th row of N 0.99, the highest score (2.90) was assigned to v1, which is a vertex in Community 1, despite v6 being in Community 2. By contrast, (15) assigns the largest value (42.60) to v4, the vertex with the largest number of in-links in Community 2. On the other hand, the ranking relative to vertex v3, located between the two communities, is a mixture the importance in two communities by (14). Fig. 2 Two community graphs induced by the graph of Fig. 1(a). All non-labeled edges have a weight of 1.0 5 Related kernels and link analysis measures In this section, we discuss some related kernels on graph vertices and link analysis measures. 5.1 Exponential diffusion kernels Several authors have proposed to use matrix exponentials instead of the Neumann series (Kandola et al. 2003; Kondor and Lafferty 2001; Shawe-Taylor and Cristianini 2004) to define kernels. Definition 2 Let G be an undirected graph with positive weights, and B be its adjacency matrix. The exponential diffusion kernel matrix E on G, with diffusion factor 0 is given by n=0 Unlike the von Neumann kernels, the series on the right hand side is convergent for all 0, and the exponential diffusion kernels are always positive semidefinite. Using a technique similar to the one used for the proof of Theorem 2, we can also show that the rankings induced by the limit of the exponential diffusion kernel tends towards the HITS authority ranking as well, if G is a co-citation graph. Theorem 3 Let v be the HITS authority vector of a citation graph A, and the dominant eigenvector of ATA. Then the matrix E (ATA)/ exp( ) converges to vvT as . Hence, the exponential diffusion kernels are also affected by topic drift, just like the von Neumann kernels. We can consider their community-based variation similar to the community-based von Neumann kernels presented in the previous section, as follows. First let us rescale to , as we did with the community-based von Neumann kernels7 t=1 As in Sect. 4.2, At is the adjacency matrix of community graph t . In Sect. 6.3, we will test this community-based exponential kernel in a paper recommendation task. 5.2 Laplacian-based kernels The regularized Laplacian (Chebotarev and Shamis 1997; Smola and Kondor 2003; Zhou and Schlkopf 2004) and the heat diffusion kernel (Chung 1997; Kondor and Lafferty 2001), 7Unlike von Neumann kernels, however, rescaled is not upper-bounded for exponential diffusion kernels. The scaling here is only for the consistency with the community-based von Neumann kernels. have definitions similar to the von Neumann and exponential diffusion kernels. However, they are defined in terms of the graph Laplacian instead of the raw adjacency matrix. This leads to the property quite different from that of the kernels based on adjacency matrices. For brevity, we only discuss the regularized Laplacian, which like von Neumann kernels is based on the Neumann series. Definition 3 Let B be the adjacency matrix of an undirected graph G with positive edge weights. The (combinatorial) Laplacian of G is defined as L(B) = D(B) B, where D(B) is a diagonal matrix with diagonals [D(B)]ii = j [B]ij . The regularized Laplacian is defined in terms of the graph Laplacian as follows. The right-hand side of (19) is the closed form of the Neumann series n( L(B))n. However, this series converges only for < 1/ (L(B)), whereas R (B) = (I + L(B))1 exists for all > 0 except = 1/ (L(B)), and if it exists, it is positive semidefinite. At the limit , the regularized Laplacian exhibits a property strikingly different from that of the von Neumann kernels: if the graph represented by adjacency matrix B is connected, R (B) tends to a uniform matrix as ; see Theorem 6 in Appendix A. In other words, in the limit , the regularized Laplacian regards all vertices equally important, or equally related to each other. This is contrastive to von Neumann kernels, as they give rankings identical to HITS near the upper bound of the admissible parameter range. In (Ito et al. 2005), we have compared the properties of the regularized Laplacian and von Neumann kernels, and argued that the regularized Laplacian remains a relatedness measure throughout their parameter range, unlike von Neumann kernels. Another interesting kernel based on the Laplacian of graphs is Saerens et al.s commute time kernels (Saerens et al. 2004; Fouss et al. 2007) defined as the pseudoinverse of the Laplacian matrix. They showed that the distance measure provided by this kernel has a natural interpretation as the commute time, defined as the expected number of steps required for a random walker starting from a given vertex to reach another given vertex for the first time, and then to go back to the starting vertex. Nadler et al. (2006) recently proposed a measure called diffusion map, which is also related to the graph Laplacian. 5.3 Smola and Kondors kernel-based regularization framework Smola and Kondor (2003) pointed out that all the Laplacian-based kernels discussed above can be viewed as a form of regularization of Laplacian eigenvalues. For example, the regularized Laplacian maps each Laplacian eigenvalue to 1/(1 + ), whereas the commute-time kernel maps to 1/ for =0 and = 0 to 0. They also mention the connection between Laplacian kernels and some importance computation methods including HITS. In this respect, Smola and Kondors formulation appears quite different from the one presented in this paper. They state that for a given vertex in a regular graph, its HITS score is given by the length of its corresponding vector in the feature space induced by kernels based on the (normalized) Laplacian. In the formulation presented in this paper, by contrast, the HITS importance scores are obtained as an extremum of von Neumann kernels defined in terms of the adjacency matrix instead of the Laplacian. In addition, our formulation takes the components of the kernel matrices directly as the score indicating relatedness or importance, instead of the distance of vertices in the kernel-induced feature space. 5.4 Relative importance Relative importance is a new class of link analysis measure recently proposed by White and Smyth (2003). This measure is defined as the importance of vertices in a graph relative to one or more root vertices. The notion of relative importance contrasts with the global importance measures such as HITS and PageRank. White and Smyth made a convincing argument that simply applying global importance algorithms to a subgraph surrounding the root vertices does not yield a precise estimate of relative importance, because the root vertices are not given any special preference during importance computation. This can be thought of as one reason topic drift persists in HITS even if Kleinbergs neighbor graph approach is taken. We have discussed topic drift in HITS and its implication on von Neumann kernels in Sect. 4. von Neumann kernels fit naturally as relative importance, and as a bonus clarify the relationship between relative importance and relatedness (namely, the co-citation and bibliographic coupling relatedness), an issue not addressed in White and Smyths work. With von Neumann kernels, relative importance is formulated explicitly as an intermediate between relatedness and global importance. 5.5 Hofmanns PLSI-based Fisher kernels Hofmann (2000) used his PLSI model to define a kernel, but in a quite different way from how we used it in Sect. 4. We briefly discuss Hofmanns work and the connection between his method and ours. The Fisher kernel (Jaakkola and Haussler 1999) is a general technique to obtain a kernel from generative models. To derive a Fisher kernel from a generative model, we need to compute the Fisher score u(d, ), the gradient of the log-likelihood function for data d with respect to parameters . Given the Fisher score u(d, ), the Fisher kernel is given by following equation. F (di , dj ) = u(di ; )TI ( )1u(dj ; ). Here, I ( ) is the Fisher information matrix, typically approximated by an identity matrix. Hofmann proposed a Fisher kernel based on PLSI for computing document similarity. The expectation of the log-likelihood in PLSI is given as follows. log p(di ) = t=1 log p(wj |t )p(t |di ) where p(wj |t ) and p(wj |di ) respectively represent the probability that word wj is generated from topic t , and the empirical probability of word wj observed in document di . Hofmann computed the derivatives of the log-likelihood function of PLSI with respect to parameters jt = 2 p(wj |t ) and t = 2p(t ) to yield two types of Fisher kernels. For the first parameter jt , the derivative is given by log p(di ) so the resulting Fisher kernel is F (d, d ) = p(wj |d)p(wj |d ) j=1 t=1 Hofmann argued that this is an inner product of empirical distribution (the p factors) but common words contribute only if they occur in the same topical context; the latter statement follows from the rightmost summation of the product of posterior probabilities. In parallel to PHITS, we can define the Fisher kernels for a citation graph by replacing word wj with cj , a citation to document j . The Fisher kernel F based on jt in this case can be written as a matrix F = t=1 where the (i,j )-element of At is p(cj |di )p(t |di , cj )/ p(cj |t ) if there is a citation from vertex i to j , and 0 otherwise. As seen from the term A tTAt in (20), this Fisher kernel essentially computes the sum of (reweighted) co-citation counts for communities t = 1, . . . , k; recall that p(t |di , cj ) is the weight assigned to edge (i, j ) in the t th community graph (see Step 2 of Sect. 4.2). A subtle difference between the graphs induced by A t and At is that the edge weights in the former are reweighted by p(cj |di )/ p(cj |t ). Because computation in (20) for each community t is identical to co-citation coupling (up to reweighting), the score is zero for any pair of vertices not directly co-cited by other vertices. In contrast, the community-based von Neumann kernel N comm (see (14)) proposed in the previous section computes the weighted sum of all paths between documents, so they assign non-zero weights to any document pair as long as they are connected in a community graph; see the discussion on von Neumann kernels in Sect. 3.3. Hofmann also proposed another Fisher kernel, this time based on the derivative with respect to parameter 2t . This Fisher kernel is defined as t=1 F (d, d ) = p(t |d)p(t |d )/p(t ), and is the (weighted) inner product in the low dimensional factor (topic) representation. This kernel is apparently intended to address the above limitation of kernel F , but it lacks a clear interpretation as a link analysis measure, as it does not involve citation c. 6 Evaluation with bibliographic citation data We carried out an empirical evaluation of von Neumann kernels and other graph-based ranking methods, using a citation network of papers on natural language processing (NLP). This citation network was built from scanned bibliographic references in the papers from major NLP journals and conference proceedings. Matching each scanned items against an existing database of NLP literature and filtering out the unmatched items resulted in a citation graph with 2867 vertices (papers) and 6939 edges (citations).8 After the co-citation graph was built from the citation graph, von Neumann and other kernel matrices were computed with MATLAB and GNU Octave from the closed form representation (e.g., ATA(I ATA)1 for von Neumann kernels) using the built-in matrix inversion and eigenvalue functions. Each kernel matrix was treated as a ranking method by taking the ith row vector of the matrix as the score vector for the ith vertex (paper). Given the ith score vector, or the rankings induced thereof, we call the ith vertex the seed vertex (or seed paper) of these rankings. 6.1 Plain von Neumann kernels In Table 1, we present the ranking lists produced by the von Neumann kernels under various diffusion parameters, for a specific seed paper Empirical studies in discourse by M.A. Walker and J.D. Moore, Computational Linguistics 23(1):112, 1997. These ranking lists are meant to illustrate the characteristics of the von Neumann kernels under different parameter values. This specific seed paper was chosen for illustration, for the following reason. It is a paper on discourse processing, and presumably, the papers most related to it must be those on discourse processing as well. Discourse processing is not the dominant sub-field of NLP, and as a result, papers on discourse are not ranked among the top 10 HITS authority list (see column H in Table 1). Hence we can expect the set of globally authoritative papers to be quite distinct from the set of papers most related to this specific paper on discourse. Consequently, given a ranking list computed for this paper, we can observe whether it is inclined towards (global) importance or (local) relatedness, by comparing the ranking with those of HITS (global importance) and co-citation (local relatedness). Another reason for choosing a discourse paper is that it is relatively easy to tell discourse papers from the rest, because terms unique to discourse are often included in the paper title. These terms are discourse, dialogue, (discourse) structure analysis, and (discourse) segmentation, for example. Table 1 lists all the 23 papers that are ranked among the top 10 in at least one of the ranking lists shown in the left-hand side of the table. Bold figures indicate the top-10 ranks, with 1 representing the highest rank. The seed paper is boldfaced, and the papers in the same field as the seed, namely discourse processing, are underlined. From the underlined titles and column CC (co-citation coupling), we see that except for one paper, the papers with nonzero co-citation scores indeed deal with discourse processing. The only exception is the most authoritative paper, the one on a standard NLP dataset (Penn Treebank). A in column CC indicates that the paper was not co-cited with the seed paper. Instead of the diffusion factor parameter introduced in (2)(7), we present its normalized version defined by = , where is the spectral radius of the co-citation matrix. Since the admissible range of is 0 < 1/, we have 0 < 1. This normalized parameter was also used in Sect. 4 to define the community-based von Neumann kernels. 8The data can be downloaded from http://cl.naist.jp/~shimbo/citationdata.html. Building a large annotated corpus of English: the Penn Treebank A stochastic parts program and noun phrase parser for unrestricted text Statistical decision-tree models for parsing Unsupervised word sense disambiguation rivaling supervised methods Word-sense disambiguation using statistical models of Rogets categories trained The mathematics of statistical machine translation Transformation-based error-driven learning and natural language processing Integrating multiple knowledge sources to disambiguate word sense Attention, intentions, and the structure of discourse Multi-paragraph segmentation of expository text Assessing agreement on classification tasks: the kappa statistic Lexical cohesion computed by thesaural relations as an indicator of the structure of text Centering: a framework for modeling the local coherence of discourse Combining multiple knowledge sources for discourse segmentation Text segmentation based on similarity between words A prosodic analysis of discourse segments in direction-giving monologue The reliability of a dialogue structure coding scheme Message Understanding Conference tests of discourse processing Empirical studies in discourse Effects of variable initiative . . . in human-computer spoken natural language dialogue Table 1 Paper rankings relative to the paper Empirical studies in discourse. HITS (H), co-citation (CC), and von Neumann kernels ( = 0.01, 0.95, 0.9999) Looking at the columns for the von Neumann kernels in Table 1, we see that at = 0.01, the top-ranked paper is the seed paper itself, followed by six papers that are co-cited with the seed. The high correlation with co-citation counts suggests that at this small value of , the kernel provides a relatedness measure. However, we already see a symptom of topic drift in the rankings assigned to papers not co-cited with the seed paper; even though there are many discourse papers in the graph, those ranked 8th and 9th are the globally authoritative Table 2 K-min distance between the von Neumann kernels and HITS, averaged over the 2280 seed vertices in the largest connected components papers with the HITS rankings of 3rd and 4th, respectively (Statistical decision-tree models for parsing and A new statistical parser based on bigram lexical dependencies). These are papers on parsing, which is the most popular sub-field of NLP. As is increased, the rankings are more inclined towards that of HITS, At = 0.9999, the two top-10 lists are mostly identical, and accordingly, the seed paper drops its ranking to 95th. The trend observed above is consistent with the role of we argued in Sect. 3.3; as tends to 1, the measure induced by the von Neumann kernel deviates away from the cocitation relatedness and approaches the HITS importance. We can show that this trend is not specific to this seed paper. Table 2 lists the average minimizing Kendall (K-min) distance (Fagin et al. 2003) between the top-10 paper lists produced by HITS and the von Neumann kernel for all 2280 seed papers in the largest connected component of the co-citation graph. Our use of the K-min distance as a measure of correlation between rankings follows White and Smyth (2003). K-min is a modification to Kendalls Tau to be applicable to partial rankings, such as top-n ranking lists. As such, K-min is a practical measure more suitable for evaluating recommender systems than Kendalls Tau, in that the highest ranked items are considered to be more relevant than those in lower ranked items. A small K-min distance means the two ranking lists are similar; it is equal to 0 if all top-10 items are identical, and takes the maximum value of 100 if there are no common items in the top-10 lists. In Table 2, the average is taken only over 2280 vertices in the largest connected components of the co-citation graph. These are the vertices to which HITS assigns nonzero scores, and the scores for those in all other components are uniformly zero. Theorem 2 guarantees the ranking produced by von Neumann kernels for these vertices (with nonzero HITS scores) to approach that of HITS. Thus, by taking average only over these vertices, it is expected that the average K-min distance should vanish as 1. As seen from Table 2, this is indeed the case, and the rankings induced by the von Neumann kernels are biased towards the HITS rankings as is increased. It is worth noting that the K-min distance makes a drastic change towards 0 near the upper bound of . 6.2 Community-based von Neumann kernels We now focus on the community-based von Neumann kernels, the von Neumann kernels combined with PLSI-based community decomposition proposed in Sect. 4. In the experiments throughout Sect. 6.2, we set the hyperparameter k (the number of communities) of PLSI to 5. 6.2.1 Comparison of the rankings relative to Empirical studies in discourse Table 3 lists the ranking list of the community-based kernels relative to the same seed paper (Empirical studies in discourse) used in Sect. 6.1, to contrast the community-based von Neumann kernels (see (14)) with the plain von Neumann kernels. The table lists the same papers listed in Table 1, the ranking lists produced by the plain von Neumann kernels. At = 0.01, most of top-ranked papers are discourse papers as indicated by the underlined paper titles. Even at = 0.95, the top-ranked papers still retain the same topic as the seed paper, which is contrastive to the plain von Neumann kernel at = 0.95 (see Table 1). The only non-discourse paper in the top-10 list is the most authoritative paper in the entire citation graph (according to HITS), Building a large annotated corpus of English: the Penn Treebank. This is not a discourse paper, but its inclusion may still be understandable since it is also co-cited with the seed paper, as seen from column CC. The trend of discourse paper being ranked higher than the rest continues to the rankings at = 0.9999. Here, the top-10 list is occupied by discourse papers (although 9th and 10th papers are not shown in the table to make the comparison with Table 1 easier). In this list, even the Penn Treebank paper is moved out of the top-10. The two globally important papers on parsing discussed in Sect. 6.1, namely, Statistical decision-tree models for parsing and A new statistical parser based on bigram lexical dependencies, never enter the top-10 rankings for the community-based von Neumann kernels. 6.2.2 Correlation with HITS Table 4 shows the K-min distance (averaged over all seed vertices) between the top-10 lists induced by our community-based von Neumann kernels and the global HITS authority rankings. It also lists the average K-min distance between the ranking list of the kernels computed for individual seed vertex, and the HITS authority rankings computed for the principal community graph of the same seed vertex. Here, the principal community t of a seed vertex is determined by arg maxt5=1 p(t |d), i.e., the community t in which p(t |d) is the highest for the vertex d . The probability p(t |d) is computed with PLSI which we also used in Step 1 of the kernel computation in Sect. 4.2. The results show that as is increased, the rankings induced by the community-based von Neumann kernels approach the HITS authority rankings in the principal community of the seed vertices, even though this is not guaranteed; there may be vertices which belongs to multiple communities, and for these vertices, the rankings may be the mixture of multiple HITS rankings in these communities. On the other hand, the distance between the community-based von Neumann kernel and plain HITS does not decrease with increased . This reflects the fact that the rankings for the seed vertices whose principal community is different from the globally most authoritative vertex do not necessarily converge to the global HITS authority rankings. 6.3 Paper recommendation system simulation Our next experiment is a simulation of a paper recommendation system. The experiment is designed with the following scenario in mind: A student, who is interested in a research field he has no knowledge of, attempts to carry out a survey of the field. His only clue to the field is a few papers he happens to own, so he inputs these papers as the seeds to the paper recommendation system, expecting to obtain a prioritized list of relevant papers in the field. To simulate the above scenario, we design our experiment on the basis of several published survey papers and the references cited within them. Each of these survey papers is assumed to be citing many of the most relevant papers of the surveyed field. Hence if we select a few of these cited papers as the seed papers that our fictitious student has input to a recommendation system, the performance of the systems (which, in our case, is the ranking methods such as the von Neumann kernels) can be evaluated in terms of how many of the remaining papers cited by the survey paper are high-ranked by the respective systems. A stochastic parts program and noun phrase parser for unrestricted text Statistical decision-tree models for parsing Unsupervised word sense disambiguation rivaling supervised methods Word-sense disambiguation using statistical models of Rogets categories trained The mathematics of statistical machine translation Transformation-based error-driven learning and natural language processing Integrating multiple knowledge sources to disambiguate word sense Attention, intentions, and the structure of discourse Multi-paragraph segmentation of expository text Lexical cohesion computed by thesaural relations as an indicator of the structure of text Centering: a framework for modeling the local coherence of discourse Combining multiple knowledge sources for discourse segmentation Text segmentation based on similarity between words A prosodic analysis of discourse segments in direction-giving monologue The reliability of a dialogue structure coding scheme Empirical studies in discourse Effects of variable initiative . . . in human-computer spoken natural language dialogue 6.3.1 Method and evaluation metric Table 4 Averages of K-min distances between the community-based von Neumann kernels, HITS, and HITS applied to the principal community graph of each seed vertex 0.95 0.99 0.999 0.9999 0.99999 HITS (entire graph) 93.7 92.9 92.0 85.4 82.4 78.2 78.2 HITS (principal community of seeds) 85.1 83.7 82.2 68.3 60.3 42.2 22.6 paper on robust parsing presented at the COLING conference, and one introductory paper on text summarization for the special issue of the Computational Linguistics journal. The surveyed fields of these papers range over various subtopics of NLP, from part-of-speech tagging, parsing, and word-sense disambiguation to text summarization, machine translation and discourse processing. None of these survey papers appear in the citation graph, and this poses a challenging task. We repeated the following procedure for each of the 12 survey papers and each ranking method. Let V be the set of all vertices (papers) in the citation graph. Suppose that we are given the ith survey paper whose citation set is denoted by Ci ( V ). 1. Generate all possible combination of a given size m (m = 1, 2, or 3) from Ci , which is the papers cited by the given survey paper. Let the generated combinations be Combm(Ci ). These will be used as the seed papers input to the system. 2. For each combination S Combm(Ci ), input S to the ranking method. Hence S is the set of seed papers. Obtain the rankings of all remaining papers V S in the citation graph. 3. At this point, S = Ci S is the set of papers that are referenced by the ith survey paper but not used as the seeds (i.e., not input to the ranking method). A good recommendation system (ranking method) is supposed to rank papers in S higher than the rest, because these papers are assumed to have high relevance to the surveyed field (because they are cited by the survey paper). 4. Calculate the number of papers in S = Ci S that are included in the top-n rankings, for various n. The evaluation metric is the micro-average recall score, where average is taken over all possible combination of seed papers for each of the 12 survey papers. For a ranking method R, let Rn(S) represent the set of papers in the top-n list output by the system R upon input seeds S . The score for this method R, under a fixed seed set size m and restricted to top n list (after taking micro average) is thus recallm(R, n) = Except co-citation coupling which may output less than n items if there are insufficient number of co-citations, all tested methods output equal number (n) of items, and the precision score is proportional to the recall score. Note that the preliminary experimental results reported in (Shimbo et al. 2007) counted seed papers as well as non-seed papers as the correct output, if seed papers are output in a top n list. But since seed paper are those input to the system by the user, recommending these back to the user does not actually make sense. Hence, the evaluation in this paper is modified so that the top n papers are to be selected only from non-seed papers. 6.3.2 Compared algorithms We computed various kernel matrices for the citation graph, and two baseline methods (HITS and co-citation rankings) for comparison. For kernel-based methods, if there is only one seed paper i, the rankings are computed from the ith row vector of the kernel matrix, in the same manner as Sects. 6.1 and 6.2. For the case of two or three seed papers, the rankings are based on the sum of the row vectors in the kernel matrix corresponding to the seeds. The compared algorithms are summarized below, together with their abbreviations. HITS: HITS authoritative rankings based on the dominant eigenvector of ATA. HITS is presented as the baseline of comparison. Notice that HITS, which does not have the concept of seed vertices, gives a single global ranking list irrespective of the seeds. CC: Rankings based on the number of co-citations with the seed paper(s). This is another baseline method. vNK/k: the community-based von Neumann kernel N comm with k communities intro duced in Sect. 4.2. vNK/1 is Kandola et al.s original von Neumann kernel. We tried the number of communities k = 1, 5, 10, 15, 20, and 30. These are regarded as different methods, to see the influence of the number of communities. The diffusion factor was chosen from = 0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99, and 0.999, so that the system achieved the best performance in terms of the average recall score over all combinations of seed size m and the size of the ranking list n reported in Table 5. EDK/k: the exponential diffusion kernels E and their community-based variation Ecomm introduced in Sect. 5.1; EDK/1 is the vanilla exponential diffusion kernels given by (17). For number of communities k 2, the kernels are given by (18). We tried the following values of : 0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99, 0.999, 1.1, 2, 3, 4, 5, 10, 100, and 1000. The best was selected in the same manner as vNK/k, and the number of communities k tried were k = 1, 5, 10, 15, 20, 30. RL: the regularized Laplacian of (19). Similar to the above kernels, the diffusion factor was rescaled by = , where is the spectral radius of the Laplacian of the co-citation graph. We tested the same values of we used for the exponential kernels EDK/k. MFA: Matrix-Forest-based Algorithm (Chebotarev and Shamis 1997) is defined as (I + L)1, and hence it is in fact a special case of the regularized Laplacian with = 1 in (19). The spectral radius of the Laplacian of the co-citation graph is 996.92, and thus MFA in this graph is equivalent to the regularized Laplacian with = 996.92. CT: Commute-time kernel (Saerens et al. 2004), or the pseudoinverse of the Laplacian matrix. CT and MFA are known to perform well in a collaborative filtering task (Fouss et al. 2006), which is quite attractive since they have no parameters to tune. 6.3.3 Results Table 5 summarizes the micro average recall given by (21), for various top-n ranking lists with n = 10, 20, 30, 40, and 50. For each kernel method with diffusion parameter , only the best results (in terms of macro-average recall of all the reported settings of m and n in the table) over tested parameter are shown.9 The parenthesized figures right of the method names are the chosen value of . See Sect. 6.3.2 for the abbreviation of the methods. The following can be observed from the table. 9In practice, parameters must be chosen via cross-validation, but since this is extremely time-consuming in our setting, the results reported here are for the best parameter values. Table 5 Average recall scores for various top-n lists and seed size m. Bold figures indicate the best scores in each (n, m) combination Table 5 (Continued) When no community decomposition was performed, i.e., k = 1, kernels based on the adjacency matrix (vNK/1 and EDK/1) performed better than the regularized Laplacian (RL) when m = 1, but they are about even when m = 2 and RL was better when m = 3. In fact, with the seed size m = 3, even the co-citation ranking (CC) performed pretty well. The best recall scores were obtained by the community-based von Neumann kernel vNK/k except one situation with m = 3 and n = 30 in which the regularized Laplacian (RL) was the best. In most cases, the optimal number of communities for vNK/k was k = 15. The exponential diffusion kernels EDK/k showed performance close to that of the von Neumann kernels vNK/k. Their performance was equal when no community decomposition was performed (i.e., EDK/1 and vNK/1). For exponential diffusion kernels, however, community decomposition was not as effective as with the von Neumann kernels. Even co-citation performed moderately well, especially when m = 3. Non-parametric Laplacian-based kernels, namely, the commute-time kernels (CT) and MFA, did not perform well in this experiment. It seems that the poor performance of MFA in this experiment can be attributed to the large parameter value = 996.92 when it is viewed as an instance of the regularized Laplacian. We observed that the commute-time kernel also output rankings similar to those by MFA. On the other hand, these kernels, MFA and the commute-time kernel, are known to perform quite well in many collaborative filtering tasks (Fouss et al. 2006). What characterizes this discrepancy, as well as which of the adjacency-matrix-based kernels or the Laplacianbased kernels should be chosen for a given task, is an interesting future research question. One conceivable factor is the presence of bias towards importance rankings in adjacencymatrix based kernels. This paper recommendation task requires ranking methods which choose important papers (in each individual research field), so this may be the reason adjacency-matrix based kernels worked better than the Laplacian-based kernels in this task. Indeed, Fouss et al. (2007) reported negative correlation between the commute-time kernels and the best-seller (or, importance) rankings in their experiments. Finally, Table 6 summarizes the results of the Wilcoxon signed rank test (Siegel and Castellan 1988) (of macro recall scores over 12 survey papers), with the null hypothesis H0: Table 6 p values of the Wilcoxon sign ranks test for recall scores over 12 survey papers for vNK/15 (single tail; in favor of vNK/15). The results are for m = 1 (one seed). Only values p < 0.05 are shown there is no performance difference between vNK/15 (0.9) (i.e., community-based von Neumann kernel with = 0.9) and the ranking methods listed above the table; alternate hypothesis H1: vNK/15 outperforms the methods above the table. Only values with significance level p < 0.05 are shown. The table is for the seed set size m = 1 only, since for m = 2 and m = 3, seeds are overlapping. For more results on the significance between all the methods involved not just vNK/15, see Appendix B. The results of Table 6 show that vNK/15 significantly outperformed HITS, MFA, and CT on all tested n, CC (co-citation) on all but n = 10, and RL (regularized Laplacian) on n = 10 and 20. The test did not show the statistical significance of vNK/15 (community-based von Neumann kernels) over vNK/1 (plain von Neumann kernels), even though the obtained microaverage recalls were consistently higher for vNK/15 when the seed size m = 1. We also conducted a paired sign test, by interpreting each of the results for all possible seed-answer pair (S, c) as a trial, where S is a singleton set of seed (since the test is again for seed size m = 1), and c C S is one of the non-seed papers cited in the survey paper. Thus we counted a + if vNK/15 correctly included c in its top-n list but vNK/1 did not, and a if vNK/1 listed c in its top-n list but not vNK/15. The sign test showed the results are statistically significant in favor of vNK/15 for all tested ns at p < 0.01 level. 7 Conclusions In this paper, we have analyzed the characteristics of the von Neumann kernels. The major contributions of this paper are as follows. (i) We have clarified the relationship between von Neumann kernels and HITS. (ii) We have pointed out that the von Neumann kernels are prone to topic drift, and the community structure of the graph must be taken into account prior to the application of the kernels. To this end, we have proposed to build a community graph corresponding to each community in the graph, using a PLSI-based graph decomposition. This decomposition borrows idea from Cohn and Changs PHITS method (Cohn and Chang 2000), and does not require the contents of documents be available. The resulting community graphs can be effectively combined with von Neumann kernels, to obtain various ranking methods respecting the community structure underlying the graph. (iii) We have presented experimental results demonstrating the effectiveness of von Neumann kernels in a simulated recommendation task. Although most of the previous work on kernel-based link analysis use Laplacian-based kernels (Fouss et al. 2007; Saerens et al. 2004; Smola and Kondor 2003; Zhou and Schlkopf 2004), the unified framework of importance and relatedness provided by von Neumann kernels is also attractive for many link analysis applications. In particular, we believe that von Neumann kernels have a great potential for recommendation systems. Consider the paper recommendation task we simulated in Sect. 6.3. In this scenario, the degree of the users acquaintance with the research field affects how much the system should bias its decision towards authoritative papers in the field. A researcher who is an expert in the field would not be happy if the system recommended the most authoritative paper in the field, since he is likely to have already known the paper. Thus, in this case, it would be reasonable to bias the ranking method towards relatedness measures. On the other hand, recommending the most well-known papers should be acceptable if the user is not familiar with the field. Such adjustment can be easily done with the single parameter of von Neumann kernels. In future work, we will apply the community-based von Neumann kernels to other data sets. A more challenging problem is the automatic determination of the number of communities in community-based von Neumann and exponential kernels. A simple approach is to use cross-validation. In the clustering literature, eigengaps of the Laplacian matrix have been used to deduce appropriate number of clusters (Zelnik-Manor and Perona 2005). We plan to check if similar approach is effective for PLSI-based graph decomposition we used for the community-based von Neumann kernels. Automatic tuning of a suitable diffusion factor is another issue in practical application of parametric kernel-based ranking methods. In this respect, Acharyya and Ghosh (2003) pursued an automatic parameter tuning method in their parametric vertex importance measure similar to the von Neumann kernels. Their optimality criterion was the stability of the parameters, but since it is only one of many criteria for optimality, we believe there is much room for further research in this area. Cristianini et al.s work on kernel alignment (Cristianini et al. 2002; Kandola et al. 2002) is a more goal-directed approach that optimizes the parameter toward an objective matrix. It is applicable to von Neumann kernels in the presence of such a matrix and looks promising. Acknowledgements We are grateful to Eric Nichols for proofreading, and to Taku Kudo for discussion in the early stage of this work. We would also like to thank three anonymous reviewers for detailed and helpful comments that greatly improved the paper. Appendix A: Proofs This appendix provides the proofs of the theorems that appeared in Sect. 3.3. We also present a theorem needed for the discussion of Sect. 5.2. Theorem 4 (Theorem 1) Let > 0 be the dominant eigenvalue of a symmetric positive semidefinite matrix ATA. If is a simple eigenvalue (i.e., has a multiplicity of one), there exists a unit eigenvector v corresponding to such that (ATA/)n vvT as n . Proof Let ATA = im=1 i vi viT be a spectral projection of ATA, with = 1 > 0 and v = v1 being the corresponding eigenvector. By raising both sides to the power of n, we have for each n = 1, 2, . . ., ATA n = = v1v1T + i=2 Because = 1 > i for every i = 2, . . . , m, the second term on the right-hand side vanishes as n . Theorem 5 (Theorem 2) Let > 0 be the dominant eigenvalue of a symmetric positive semidefinite matrix ATA. If is a simple eigenvalue, there exists a unit eigenvector v corresponding to such that where the limit is taken from below. Proof Let ATA = im=1 i vi viT be a spectral projection of ATA, with the dominant eigenpair = 1 and v = v1. Suppose < 1/1. By the infinite series representation of the von Neumann kernels (see (6)), we have i=1 n=1 N (ATA) = v1v1T + i=m2 i1((11 1i )) vi viT, The following theorem is for the discussion of Sect. 5.2. Theorem 6 Consider an undirected connected graph with positive edge weights, and let B be its adjacency matrix. Then, the regularized Laplacian R (B) given by (19) tends to a uniform matrix as . Proof Let m be the number of vertices in the graph, and let {(i , vi )}im=1 be the eigenpairs of L(B). As we have assumed the graph represented by B is connected, first note the wellknown fact that the Laplacian matrix has an eigenpair (1, v1) = (0, 1/m), and all other eigenvalues i (i =1) are nonzero. Further let L(B) = P P 1 be the spectral decomposition of L(B), with P = [v1, . . . , vm] an orthogonal matrix with each vi representing a unit eigenvector corresponding to i , and a diagonal matrix with eigenvalues i as the diagonal element [ ]i,i . It follows that I + L(B) = P I P 1 + P P 1 = P (I + ) P 1 and hence R (B) = (I + L(B))1 = P (I + )1 P 1. Here, notice that (I + )1 is a diagonal matrix with diagonal elements [(I + ) 1]i,i = 1/(1 + i ). In particular, since 1 = 0, we have [(I + ) 1]1,1 = 1, a constant. Taking the limit yields Appendix B: Full results of Wilcoxon sign ranks test Table 7 shows the full results of the Wilcoxon sign rank test discussed in Sect. 6.3.3. Table 7 Full results of the Wilcoxon sign ranks test for recall scores on top-n lists (p values; single tail; in favor of the method on the left over the one at the top). Only values p < 0.05 are shown (a) Top-10 lists vNK/1 vNK/15 EDK/1 EDK/15 vNK/1 vNK/15 EDK/1 EDK/15 (b) Top-20 lists Table 7 (Continued) (c) Top-30 lists (d) Top-40 lists (e) Top-50 lists 0.02734 0.02441 0.03857 0.03857


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10994-008-5090-6.pdf

Masashi Shimbo, Takahiko Ito, Daichi Mochihashi, Yuji Matsumoto. On the properties of von Neumann kernels for link analysis, Machine Learning, 2009, 37-67, DOI: 10.1007/s10994-008-5090-6