Cover Learning for Topological Inference
The standard approach to Topological Inference is based on geometric complexes. Most commonly, geometric complexes scale cubically (and often worse) in the number of data points, which poses a big problem. Here I describe an alternative approach to Topological Inference.
For motivation and an intro to simplicial complexes, you can check my previous post on Topological Inference and Unsupervised Learning. The approach described in this post appears in our recent paper [Scoccola et al. ICML]; for more details and references, it’s best to check there.
Geometric Complexes
Standard examples of geometric complexes include the Rips complex, the Čech complex, and the Alpha complex. All of these are filtered (or weighted) simplicial complexes defined given an input metric space (assumed to be a subset of $\mathbb{R}^n$ in the latter two).
Why geometric complexes? Two main results make them particularly appealing from a theoretical point of view. First, the geometric complexes of metric spaces that are similar (in the Gromov-Hausdorff sense) are necessarily similar[Chazal et al. Geom. Dedicata]. Second, the geometric complex of a sufficiently well-behaved metric space (such as a compact manifold) encodes the topology of the space itself[Hausmann. Prospects in Topology]. These two results imply a consistency result for geometric complexes: When applied to a sufficiently good sample of a sufficiently well-behaved space, they encode the topology of the space, and can thus be used for Topological Inference[Boissonnat et al. Cambridge Texts Appl. Math.].
Complexity and sparsification. Unfortunately, geometric complexes are huge. For example, to compute the homology of the Rips complex of a finite metric space $X$ up to dimension $m$, one needs to construct a simplicial complex with $\Theta(|X|^{m+2})$ simplices, and then perform Gaussian elimination on a matrix of that size!
Because of this, many sparsification techniques for geometric complexes (and simplicial complexes in general) have been proposed. While these do improve computation time and memory to an extent, scaling remains well above quadratic, and it is rare to be able to compute the homology of a high-dimensional point cloud with more than a few thousand points.
The problem. I believe that the main issue is that geometric complexes always have the data points as vertices. This will remain an issue as long as we use simplicial complexes that take data points as vertices.
Covers and Nerve
Luckily, there is a great (and classical) source of simplicial complexes that does not use points as vertices: the nerve complex of a cover.
A cover of a set $X$ is a set of subsets of $X$ whose union is $X$. The nerve of such a cover is a simplicial complex with vertices the subsets in the cover, and with higher dimensional simplices given by the non-empty intersections:
Cover Learning
In the context of Topological Inference, the nerve construction allows us to reduce the problem of constructing a simplicial complex to that of constructing a cover of the data. If we construct a cover with $k$ elements, the output simplicial complex will have $k$ vertices, and if we have control over $k$, we can make it much smaller than the number of data points. How to learn covers is an interesting problem in its own right, and is related to soft clustering.
In our paper, we approach it from the perspective of geometric optimization, by minimizing a certain loss function for covers that we design. Here’s an example from the paper:
Pretty cool tricks go into our approach, but what I want to emphasize here is not our particular method (which at the end of the day is heuristic), but rather that:
Covers should be a central tool in Topological Inference.
I’ll conclude with some open questions:
- Are there practical cover learning algorithms that are consistent even if one fixes the number of cover elements $k$? That is, I want a practical cover learning algorithm whose output nerve is topologically correct in the limit of infinitely many sample points, and is restricted to have no more than $k$ vertices (with $k$ depending on the space the data is being sampled from). This would stand in stark contrast to the consistency of geometric complexes, whose size is unbounded as the sample size grows.
- Are there approaches to cover learning that are simpler and more robust than the one we propose?
- What is the relationship between fuzzy covers and fuzzy clusterings? Are any of the standard fuzzy clustering algorithms topologically consistent? What is a fuzzy cover and why would one consider this is explained in Section 3 of our paper.