On the Metric Distortion of Embedding Persistence Diagrams into Separable Hilbert Spaces (pdf)

Article PDF cannot be displayed. You can download it here:

http://drops.dagstuhl.de/opus/volltexte/2019/10425/pdf/LIPIcs-SoCG-2019-21.pdf

On the Metric Distortion of Embedding Persistence Diagrams into Separable Hilbert Spaces

On the Metric Distortion of Embedding Persistence Diagrams into Separable Hilbert Spaces Mathieu Carrière Department of Systems Biology, Columbia University, New York, USA Ulrich Bauer Department of Mathematics, Technical University of Munich (TUM), Germany Abstract Persistence diagrams are important descriptors in Topological Data Analysis. Due to the nonlinearity of the space of persistence diagrams equipped with their diagram distances, most of the recent attempts at using persistence diagrams in machine learning have been done through kernel methods, i.e., embeddings of persistence diagrams into Reproducing Kernel Hilbert Spaces, in which all computations can be performed easily. Since persistence diagrams enjoy theoretical stability guarantees for the diagram distances, the metric properties of the feature map, i.e., the relationship between the Hilbert distance and the diagram distances, are of central interest for understanding if the persistence diagram guarantees carry over to the embedding. In this article, we study the possibility of embedding persistence diagrams into separable Hilbert spaces with bi-Lipschitz maps. In particular, we show that for several stable embeddings into infinite-dimensional Hilbert spaces defined in the literature, any lower bound must depend on the cardinalities of the persistence diagrams, and that when the Hilbert space is finite dimensional, finding a bi-Lipschitz embedding is impossible, even when restricting the persistence diagrams to have bounded cardinalities. 2012 ACM Subject Classification Mathematics of computing → Algebraic topology Keywords and phrases Topological Data Analysis, Persistence Diagrams, Hilbert space embedding Digital Object Identifier 10.4230/LIPIcs.SoCG.2019.21 1 Introduction The increase of available data in both academia and industry has been exponential over the past few decades, making data analysis and machine learning ubiquitous in many different fields of science. Topological Data Analysis (TDA) [5] is one specific field of data science, which focuses more on complex rather than big data. The general assumption of TDA is that data is actually sampled from geometric or low-dimensional domains, whose topological features are relevant to the analysis. These topological features are usually encoded in a mathematical object called persistence diagram, which is roughly a set of points in the plane, each point representing a topological feature whose size is contained in the coordinates of the point. Persistence diagrams have been proved to bring complementary information to other traditional descriptors in many different applications, often leading to large result improvements. This is also due to the stability properties of the persistence diagrams, which state that persistence diagrams computed on similar data are also very close in the diagram distances [2, 8, 9]. Unfortunately, the use of persistence diagrams in machine learning methods is not straightforward, since many algorithms expect data to be Euclidean vectors, while persistence diagrams are sets of points with possibly different cardinalities. Moreover, the diagram distances used to compare persistence diagrams are computed by means of optimal matchings, © Mathieu Carrière and Ulrich Bauer; licensed under Creative Commons License CC-BY 35th International Symposium on Computational Geometry (SoCG 2019). Editors: Gill Barequet and Yusu Wang; Article No. 21; pp. 21:1–21:15 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 21:2 On the Metric Distortion of Embedding PDs into Separable Hilbert Spaces and thus are quite different from Euclidean metrics. The usual way to cope with such difficult data is to use kernel methods. A kernel is a symmetric function on the data whose evaluation on a pair of data points equals the scalar product of the images of these points under a feature map into a Hilbert space, called the Reproducing Kernel Hilbert Space of the kernel. Many algorithms can be kernelized, such as PCA and SVM, allowing one to handle non-Euclidean data as soon as either a kernel or a feature map is available. Hence, the question of defining a feature map into a Hilbert space has been intensively studied in the past few years, and, as of today, various methods have been proposed and implemented, either into finite or infinite dimensional Hilbert spaces [4, 7, 21, 16, 1, 6, 13]. Since persistence diagrams enjoy stability properties, it is also natural to ask the same guarantee for their embeddings. Indeed, various feature maps defined in the literature satisfy a stability property stating that the Hilbert distance between the image of the persistence diagrams is upper bounded by some specific diagram distance, most commonly the 1Wasserstein diagram distance. In many cases, this upper bound applies only to a restricted set of persistence diagrams with bounded number and bounded range of persistence pairs, and these bounds enter the constant in the stability estimate. However, some unconditional stability results exist as well, e.g., for the Persistence Scale Space feature map [21]. A more difficult question is to prove whether a lower bound also holds or not. As a first step in this direction, a lower bound for the Sliced Wasserstein distance was proved in [6], showing that this metric is equivalent to the first diagram distance. Moreover, since the Sliced Wasserstein distance is conditionally negative definite, a Gaussian kernel can be defined with it with Berg’s theorem [3]. However, even in this case, the resulting Sliced Wasserstein kernel distance is not equivalent to the Sliced Wasserstein distance, and so the corresponding feature map is not guaranteed to be bi-Lipschitz. Thus, the question remained open in general. Contributions In this article, we consider the general question of the existence of bi-Lipschitz embeddings of persistence diagrams into separable Hilbert spaces. More precisely, we show the following results: For several stable feature maps defined in the literature, if such a bi-Lipschitz embedding exists for persistence diagrams with bounded number and range of points, then the ratio between upper and lower bound goes to ∞ as the bounds on the number of points in the persistence diagrams and on their range increase to ∞ (Theorem 3.5 and Proposition 3.9). Such a bi-Lipschitz embedding does not exist if the Hilbert space is finite dimensional (Theorem 4.4), Finally, we also provide experimental evidence of this behavior by computing the metric distortions of various feature maps for persistence diagrams with increasing cardinalities. Related work Feature maps for persistence diagrams can be classified into two different classes, depending on whether the corresponding Hilbert space has finite or infinite dimension. In the infinite dimensional case, the first attempt was that proposed in [4], in which persistence diagrams are turned (...truncated)