Density-based Topology Learning from Data

XIMENA FERNANDEZ

Statistics and Data Science Seminar

Queen Mary University - 7 October 2024

Topology vs Geometry

Geometry = fine details + quantitative answers

Topology = fundamental properties + qualitative answers

Topological inference

Let $X$ be a space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Topological inference

Let $X$ be a space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Point cloud
$\mathbb{X}_n \subset \mathbb{R}^D$

For $\epsilon>0$, the $\epsilon$-thickening of $\mathbb{X}_n$: \[\displaystyle U_\epsilon = \bigcup_{x\in \mathbb{X}_n}B_{\epsilon}(x)\]

Homology inference

Let $X$ be a topological space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Point cloud
$\mathbb{X}_n \subset \mathbb{R}^D$

For $\epsilon>0$, the $\epsilon$-thickening of $\mathbb{X}_n$: \[\displaystyle U_\epsilon = \bigcup_{x\in \mathbb{X}_n}B_{\epsilon}(x)\]

Theorem (Niyogi, Smale & Weinberger, 2008). Given $\mathcal{M}$ a compact submanifold of $\mathbb{R}^D$ of dimension $k$ and $\mathbb{X}_n$ a set of i.i.d. $n$ points drawn according to the uniform probability measure on $\mathcal{M}$, then $$ U_\epsilon \simeq \mathcal{M}$$ with probability $>1-\delta$ if $0<\epsilon< \frac{\tau_\mathcal{M}}{2}$ and $n> \beta_1 \left(\log(\beta_2)+\log\left(\frac{1}{\delta}\right)\right)$.*

*Here $\beta_1=\frac{\mathrm{vol}(\mathcal{M})}{cos^k(\theta_1) \mathrm{vol(B^k_{\epsilon/4})}}$, $\beta_2=\frac{\mathrm{vol}(\mathcal{M})}{\cos^k(\theta_2)\mathrm{vol}(B^k_{\epsilon/8})}$ and $\theta_1=\arcsin\left(\frac{\epsilon}{8\tau_\mathcal M}\right),$ $\theta_2=\arcsin\left(\frac{\epsilon}{16\tau_\mathcal M}\right)$.

Topological inference

Let $X$ be a space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Point cloud
$\mathbb{X}_n \subset \mathbb{R}^D$

Evolving thickenings

Topological inference

Let $X$ be a topological space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Point cloud
$\mathbb{X}_n \subset \mathbb{R}^D$

Evolving thickenings

Filtration of simplicial complexes

$V_\epsilon(\mathbb{X}_n) = \{\sigma\subseteq \mathbb{X}_n\colon \mathrm{diam}(\sigma)<\epsilon\}$

$~~\epsilon = 0~~~~~~\epsilon = 0.7 ~~~~~~ \epsilon = 1.2 ~~~~~~ \epsilon = 1.7$

Topological inference

Let $X$ be a topological space and let $\mathbb{X}_n = \{x_1,...,x_n\}$ be a finite sample of $X$.

Q: How to infer topological properties of $X$ from $\mathbb{X}_n$?

Point cloud
$\mathbb{X}_n \subset \mathbb{R}^D$

Filtration of simplicial complexes

$\epsilon = 0~~~~~~~\epsilon = 0.7 ~~~~~~~ \epsilon = 1.2 ~~~~~~~ \epsilon = 1.7$

Persistent homology

Metric space $~~~~\longrightarrow~~~~~$ Persistence diagram

Persistent homology

Metric space: $(\mathbb X_n, d_E)\sim (\mathcal M, d_E)$

$\bullet ~ ~\mathrm{Rips}_\epsilon(\mathcal{M}, d_E)\simeq \mathcal{M}$ for $\epsilon < 2 \sqrt{\frac{D+1}{2D}}\mathrm{rch}(\mathcal{M})~~$ (Kim, Shin, Chazal, Rinaldo & Wasserman, 2020)

Persistent homology

Metric space: $(\mathbb X_n, d_{kNN})\sim (\mathcal M, d_\mathcal{M})~~~$
(Bernstein, De Silva, Langford & Tenenbaum, 2000)

$\bullet ~ ~\mathrm{Rips}_\epsilon(\mathcal{M}, d_\mathcal{M})\simeq \mathcal{M}$ for $\epsilon < \mathrm{conv}(\mathcal{M}, d_{\mathcal{M}})~~$ (Hausmann, 1995; Latschev, 2001)

Persistent homology

Metric space: $(\mathbb X_n, d_{kNN})\sim (\mathcal M, d_\mathcal{M})$

Persistent homology

Metric space: $(\mathbb X_n, d_{kNN})\sim (\mathcal M, d_\mathcal{M})$

Persistent homology

Metric space: $(\mathbb X_n, d_{kNN})\sim (\mathcal M, d_\mathcal{M})$

Density-based geometry

Let $\mathbb{X}_n = \{x_1,...,x_n\}\subseteq \mathbb{R}^D$ be a finite sample.

Density-based geometry

Let $\mathbb{X}_n = \{x_1,...,x_n\}\subseteq \mathbb{R}^D$ be a finite sample.

Assume that:

$\mathbb{X}_n$ is a sample of a compact manifold $\mathcal M$ of dimension $d$.
The points are sampled according to a density $f\colon \mathcal M\to \mathbb R$.

Density-based geometry

Fermat principle

The path taken by a ray between two given points is the path that can be traversed in the least time.

That is, it is the extreme of the functional \[ \gamma\mapsto \int_{0}^1\eta(\gamma_t)||\dot{\gamma}_t|| dt \] with $\eta$ is the refraction index.

Density-based geometry

(Hwang, Damelin & Hero, 2016)

Let $\mathcal M \subseteq \mathbb{R}^D$ be a manifold and let $f\colon\mathcal{M}\to \mathbb{R}_{>0}$ be a smooth density.

For $q>0$, the deformed Riemannian distance* in $\mathcal{M}$ is \[d_{\mathcal M, f,q}(x,y) = \inf_{\gamma} \int_{I}\frac{1}{f(\gamma_t)^{q}}||\dot{\gamma}_t|| dt \] over all $\gamma:I\to \mathcal{M}$ with $\gamma(0) = x$ and $\gamma(1)=y$.

* Here, if $g$ is the inherited Riemannian tensor, then $d_{f,q}$ is the Riemannian distance induced by $g_q= f^{-2q} g$.

Fermat distance

(Mckenzie & Damelin, 2019) (Groisman, Jonckheere & Sapienza, 2022)

Let $\mathbb{X}_n = \{x_1,...,x_n\}\subseteq \mathbb{R}^D$ be a finite sample.

For $p> 1$, the Fermat distance between $x,y\in \mathbb{R}^D$ is defined by \[ d_{\mathbb{X}_n, p}(x,y) = \inf_{\gamma} \sum_{i=0}^{r}|x_{i+1}-x_i|^{p} \] over all paths $\gamma=(x_0, \dots, x_{r+1})$ of finite length with $x_0=x$, $x_{r+1} = y$ and $\{x_1, x_2, \dots, x_{r}\}\subseteq \mathbb{X}_n$.

Fermat distance

Fermat distance

Fermat distance

Convergence results

$d_{\mathbb X_n, p}$ is an estimator of $d_{\mathcal M, f,q}$ if $q=(p-1)/d$.

\[d_{\mathbb{X}_n, p}(x,y) = \inf_{\gamma} \sum_{i=0}^{r}|x_{i+1}-x_i|^{p}\]

\[d_{\mathcal M, f,q}(x,y) = \inf_{\gamma} \int_{I}\frac{1}{f(\gamma_t)^{q}}||\dot{\gamma}_t|| dt \]

Convergence results

(F., Borghini, Mindlin & Groisman, 2023)

Let $\mathcal{M}$ be a closed smooth $d$-dimensional manifold embedded in $\mathbb{R}^D$.

\[\big(\mathbb{X}_n, C(n,p,d) d_{\mathbb{X}_n,p}\big)\xrightarrow[n\to \infty]{GH}\big(\mathcal{M}, d_{f,q}\big) ~~~ \text{ for } q = (p-1)/d\]

Recall that $$d_{H}\big((X, d)(Y,d)\big) = \max \big\{\sup_{x\in X}d(x,Y), \sup_{y\in Y}d(X,y)\big\}, ~~\text{for }X,Y\subseteq (Z,d)$$ $$d_{GH}\big((X, d_X),(Y,d_Y)\big)= \inf_{\substack{Z \text{ metric space}\\ f:X\to Z, g:Y\to Z \text{ isometries}}}d_H(f(X), g(Y))$$

Convergence results

(F., Borghini, Mindlin & Groisman, 2023)

Let $\mathcal{M}$ be a closed smooth $d$-dimensional manifold embedded in $\mathbb{R}^D$.

\[\big(\mathbb{X}_n, C(n,p,d) d_{\mathbb{X}_n,p}\big)\xrightarrow[n\to \infty]{GH}\big(\mathcal{M}, d_{f,q}\big) ~~~ \text{ for } q = (p-1)/d\]

Theorem (F., Borghini, Mindlin, Groisman, 2023)

Let $\mathbb{X}_n$ be a sample of a closed manifold $\mathcal M$ of dimension $d$, drawn according to a density $f\colon \mathcal M\to \mathbb R$.
Given $p>1$ and $q=(p-1)/d$, there exists a constant $\mu = \mu(p,d)$ such that for every $\lambda \in \big((p-1)/pd, 1/d\big)$ and $\varepsilon>0$ there exist $\theta>0$ satisfying \[ \mathbb{P}\left( d_{GH}\left(\big(\mathcal{M}, d_{f,q}\big), \big(\mathbb{X}_n, {\scriptstyle \frac{n^{q}}{\mu}} d_{\mathbb{X}_n, p}\big)\right) > \varepsilon \right) \leq \exp{\left(-\theta n^{(1 - \lambda d) /(d+2p)}\right)} \] for $n$ large enough.

Fermat-distance

Computational implementation

Complexity:
$O(n^3)$
reducible to $O(n^2*k*\log(n))$ using the $k$-NN-graph (for $k = O(\log n)$ the geodesics belong to the $k$-NN graph with high probability).
Python library:
fermat
Computational experiments:
ximenafernandez/intrinsicPH

Persistent homology

Convergence of persistence diagrams

\[\big(\mathbb{X}_n, C(n,p,d) d_{\mathbb{X}_n,p}\big)\xrightarrow[n\to \infty]{GH}\big(\mathcal{M}, d_{\mathcal{M},f,q}\big) ~~~ \text{ for } q = (p-1)/d\]

Stability \[d_B\Big( \mathrm{dgm}\big(\mathrm{Filt}(X, d_X)\big), \mathrm{dgm}\big(\mathrm{Filt}(Y, d_Y\big)\Big)\leq 2 d_{GH}\big((X,d_X),(Y,d_Y)\big)\]

Convergence of persistence diagrams

\[\big(\mathbb{X}_n, C(n,p,d) d_{\mathbb{X}_n,p}\big)\xrightarrow[n\to \infty]{GH}\big(\mathcal{M}, d_{\mathcal{M},f,q}\big) ~~~ \text{ for } q = (p-1)/d\]

Stability \[d_B\Big( \mathrm{dgm}\big(\mathrm{Filt}(X, d_X)\big), \mathrm{dgm}\big(\mathrm{Filt}(Y, d_Y\big)\Big)\leq 2 d_{GH}\big((X,d_X),(Y,d_Y)\big)\]

Recall that $$d_{B}\big(\mathrm{Dgm_1}, \mathrm{Dgm_2}\big) = \inf_{\alpha \in \Lambda}\sup_{p\in \mathrm{Dgm_1}}||p-\alpha(p)||_\infty$$ where $\Lambda$ is the set of all bijections between $\mathrm{Dgm_1 + Diag}$ and $\mathrm{Dgm_2 + Diag}$.

Convergence of persistence diagrams

\[\big(\mathbb{X}_n, C(n,p,d) d_{\mathbb{X}_n,p})\big)\xrightarrow[n\to \infty]{GH}\big(\mathcal{M}, d_{\mathcal{M},f,q}\big) ~~~ \text{ for } q = (p-1)/d\]

Stability \[d_B\Big( \mathrm{dgm}\big(\mathrm{Filt}(X, d_X)\big), \mathrm{dgm}\big(\mathrm{Filt}(Y, d_Y\big)\Big)\leq 2 d_{GH}\big((X,d_X),(Y,d_Y)\big)\]

$\Downarrow$

\[\mathrm{dgm}(\mathrm{Filt}(\mathbb{X}_n, {C(n,p,d)} d_{\mathbb{X}_n,p}))\xrightarrow[n\to \infty]{B}\mathrm{dgm}(\mathrm{Filt}(\mathcal{M}, d_{\mathcal{M},f,q})) ~~~ \text{ for } q = (p-1)/d\]

Fermat-based persistence diagrams

Intrinsic reconstruction

$\bullet ~ ~\mathrm{Rips}_\epsilon(\mathcal{M}, d_\mathcal{f,q})\simeq \mathcal{M}$ for $\epsilon < \mathrm{conv}(\mathcal{M}, d_{f,q})$

Fermat-based persistence diagrams

Robustness to outliers

Fermat-based persistence diagrams

Robustness to outliers

Prop (F., Borghini, Mindlin, Groisman, 2023)

Let $\mathbb{X}_n$ be a sample of $\mathcal{M}$ and let $Y\subseteq \mathbb{R}^D\smallsetminus \mathcal{M}$ be a finite set of outliers. Let $\delta = \displaystyle \min\Big\{\min_{y\in Y} d_E(y, Y\smallsetminus \{y\}), ~d_E(\mathbb X_n, Y)\Big\}$.
Then, for all $k>0$ and $p>1$, \[ \mathrm{dgm}_k(\mathrm{Rips}_{<\delta^p}(\mathbb{X}_n \cup Y, d_{\mathbb{X}_n\cup Y, p})) = \mathrm{dgm}_k(\mathrm{Rips}_{<\delta^p}(\mathbb{X}_n, d_{\mathbb{X}_n, p})) \] where $\mathrm{Rips}_{<\delta^p}$ stands for the Rips filtration up to parameter $\delta^{p}$ and $\mathrm{dgm}_k$ for the persistent homology of deg $k$.

Application to time series analysis

Topological analysis of time series

Signal: $\varphi:\mathbb R \to \mathbb{R}$

We assume that $\varphi$ is an observation of an underlying dynamical system $(\mathcal M, \phi)$, with $\mathcal M$ a topological space and $\phi\colon \mathbb R \times \mathcal M\to \mathcal M$ the evolution function.

That is, there exist an observation function $F:\mathcal M\to \mathbb R$ and an initial state $x_0\in \mathcal M$ such that \begin{align}\varphi = \varphi_{x_0}:\mathbb{R}&\to \mathbb{R}\\ t&\mapsto F(\phi_t(x_0)) \end{align}

Topological analysis of time series

Signal: $\varphi:\mathbb R \to \mathbb{R}$
Delay embedding: Given $T$ the time delay and $D$ the embedding dimension. \[\mathcal{M}_{T,D} = \{\big(\varphi(t), \varphi(t+T), \varphi(t+2 T) \dots, \varphi(t+(D-1)T)\big): t\in \mathbb R\}\subseteq \mathbb{R}^D\]

Topological analysis of time series

Signal: $\varphi:\mathbb R \to \mathbb{R}$
Delay embedding: Given $T$ the time delay and $D$ the embedding dimension. \[\mathcal{M}_{T,D} = \{\big(\varphi(t), \varphi(t+T), \varphi(t+2 T) \dots, \varphi(t+(D-1)T)\big): t\in \mathbb R\}\subseteq \mathbb{R}^D\]
Limit set: Given $(\mathcal M, \phi)$ a dynamical system and $x_0\in \mathcal M$, \[\mathcal A_{x_0} = \{x\in \mathcal M: \exists t_i\to \infty \text { with } \phi_{t_i}(x_0)\to x\}.\]

Topological analysis of time series

Signal: $\varphi:\mathbb R \to \mathbb{R}$
Delay embedding: Given $T$ the time delay and $D$ the embedding dimension. \[\mathcal{M}_{T,D} (\phi) = \{\big(\varphi(t), \varphi(t+T), \varphi(t+2 T) \dots, \varphi(t+(D-1)T)\big): t\in \mathbb R\}\subseteq \mathbb{R}^D\]
Limit set: Given $(\mathcal M, \phi)$ a dynamical system and $x_0\in \mathcal M$, \[\mathcal A_{x_0} = \{x\in \mathcal M: \exists t_i\to \infty \text { with } \phi_{t_i}(x_0)\to x\}.\]
Theorem (Takens).* Let $\mathcal{M}$ be a smooth, compact, Riemannian manifold. Let $T> 0$ be a real number and let $D > 2 \mathrm{dim}(\mathcal{M})$ be an integer. Then, for generic $\phi \in C^2(\mathbb{R} \times \mathcal{M}, \mathcal{M})$, $F\in C^2(\mathcal{M}, \mathbb{R})$ and $x_0\in \mathcal M$, if $\varphi_{x_0} = F(\phi_\bullet(x_0))$ is an observation of $(\mathcal M, \phi)$, then the limit set $\mathcal A_{x_0}$ is 'diffeomorphic'$^{**}$ to $\mathcal{M}_{T,D} (\varphi_{x_0})$.

** There exists $\psi:\mathcal M\to \mathbb R^{D}$ an embedding such that $\psi|_{\mathcal A_{x_0}}: \mathcal A_{x_0}\to \mathcal{M}_{T,D} (\varphi_{x_0})$ is a bijection.
*Corollary 5, Detecting strange attractors in tubulence, F. Takens, 1971.