Mapper [1] is a topological data analysis procedure that infers a simplicial complex of a topological space from a point cloud of the space. The simplicial complex can reveal various topological or geometric property of $X$, such as number of connected components, number of holes, and flares, either by visualization or by
computational algebraic tools. In this note, I will describe my understanding of the theory behind the method.
A typical output of Mapper on a point cloud is a graph like the plot below. The procedure is statistical by nature, but it tries to mimic a rigorous topological construction described in [1] and [2]. I will first outline the topological construction, then describe the statistical one that is Mapper.
|
A 1D simplicial complex returned by Mapper: each node represent a region that is estimated to be homotopy equivalent to a point (contractible); the number of observed data points in that region is represented by the number on the node and size of the node; and the colour of the node correspond to the average value of function $f$ over points in that region ($f$ is explained below). |
Ideas behind the topological construction
Given a finite open covering of $X$ (a finite collection of open subsets of $X$) whose union is the whole space $X$), the index set of the open covering defines a simplicial complex, also called the nerve of the covering. So the key is to come up with a finite open covering of $X$, but this might be hard if $X$ is very exotic. Instead, if we have a continuous function $f:X \rightarrow Z$, where $Z$ is a more familiar space such as $\mathbb{R}^n$, then an open covering $\mathit{U} = \{\mathit{U}_\alpha\}$ of $Z$ yields an open covering of $X$. This is because each $f^{-1}(\mathit{U}_\alpha)$ is open, and every point in $X$ is mapped to some point in $Z$, hence contained in some $f^{-1}(\mathit{U}_\alpha)$. Each $f^{-1}(\mathit{U}_\alpha)$ might be made of multiple path connected components, so this open covering of X, consists of all the path connected components of all the $f^{-1}(\mathit{U}_\alpha)$. Two important properties of this construction:
- The dimensionality of $Z$ determines the dimensionality of simplicial complex that can be recovered. So if $Z$ is $\mathbb{R}$, then the highest dimension of simplices is 1, which is the edges in a graph like in the figure above. The resulting graph is also a generalization of the Reeb graph which tracks changes in topology of level sets of $f$.
- The construction can reveal structure at multiple resolution, in the sense that if there are two covering $\{\mathit{U}_\alpha\}$ and $\{\mathit{V}_\beta\}$ of $Z$ at different resolution such that any $\mathit{U}_\alpha$ is contained in some $\mathit{V}_\beta$. Then the resulting pre-image coverings of $X$ are also at different resolution because $f^{-1}(\mathit{U}_\alpha) \in f^{-1}(\mathit{V}_\beta)$.
The statistical version
Now given a point cloud $\{x_i\}$ from $X$, to mimic the above topological construction, Mapper first needs a continuous function $f$. The choice of this function dictates what kind of structure can be recovered, so it should reflect the intended purpose of study. For example, if we are interested in the intrinsic geometric structure of X, then we might take $f$ to be some sort of distance to some reference point; or if we are interested in the probabilistic structure over X that generates the point cloud, then we might take $f$ to be some kernel density estimate; finally, if what we have is a regression dataset, so that for each $x_i$, there is corresponding $y_i$, and we know that the underlying regression function is continuous, then we do not need to choose some other $f$.
Having chosen $f$, to build the covering of $X$ by pre-image, we first cover the range of $f$ using overlapping open sets. If the target space of $f$ is Euclidean, then we can just use overlapping interval/square/cube/hyper-cube as bins. In the topological construction, we now would have connected components in the pre-images of these bins, but with point clouds, all we have are subsets of the points that are mapped to the same bin. The main idea in the statistical version is to use a clustering algorithm to group the points in the pre-image that are more likely to be in the same connected components. Mapper uses single-linkage clustering which does not require predetermining the number of clusters. After clustering, Mapper omits all the structure inside each connected component, and assumes that it is homotopy equivalent to a point (contractible), and represents it as a single node in the resulting graph (as in the figure above). Finally, if two connected components from pre-image of different bins share at least one point (can happen because bins overlap), then their corresponding contraction node is linked by an edge.
Furthermore, by controlling the size of bins and how much they overlap, we control the resolution of coverings as previously discussed in the topological construction. This allows recovering structures at multiple resolution from the point cloud.
The Mapper procedure can work with higher dimensional output $f$, which would result in higher dimensional simplicial complexes, but they are harder to visualize. Using a scalar-valued function, typical resulting graphs look like the figure above. Function values of $f$ (assuming space is $\mathbb{R}$) are shown as color of nodes, and number of points in the connected component is reflected by the size of the node and the number on it.
One thing which we have omitted in discussion is the affinity measure in the clustering algorithm used to partition pre-images into connected components. Different affinity measures reflect different notion of "neighbourhood'', so would result in different partitioning of the pre-images. Fortunately, if the affinity measures are similar, Mapper is robust with respect to the choice of this affinity. In particular, if the affinity measure is a metric, and $X$ is finite dimensional, then by equivalence of norm-induced topologies in finite dimensional normed spaces, we should recover the same structure if the point cloud is sampled dense enough or if the covering resolution is coarse enough. I have done some preliminary experiments suggesting that Mapper indeed has this property. I'll blog about it once I have some more comprehensive results.
Reference:
[1] Singh, Gurjeet, Facundo Mémoli, and Gunnar E. Carlsson. "Topological
Methods for the Analysis of High Dimensional Data Sets and 3D Object
Recognition." In SPBG, pp. 91-100. 2007.
[2] Gunnar Carlsson. "Topology and data." Technical report, 2008.
[3] Lum, P. Y., G. Singh, A.
Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson,
and G. Carlsson. "Extracting insights from the shape of complex data
using topology." Scientific reports 3 (2013).