Topological Data Analysis

Gregory Clark

July 30, 2014

Topology is "Pure" Math


Topology has traditionally been considered pure math.

http://xkcd.com/435/

Our goal is to understand the shape of high-dimensional data better using topological tools.

Homotopy Groups


Homotopic loops from Hatcher's Algebraic Topology

The corner stone algebraic topology, \(\pi_1\), is the group of equivalence classes of loops where two loops are equivalent if one can be continuously deformed into the other.


More generally, other homotopy groups are defined on equivalence classes of higher dimensional “loops” (maps from higher dimensional spheres).


These groups are topological invariants, and computing them would help us identify the shape of spaces. However, . . .

Intractible Computations


However, homotopy groups are notoriously difficult to compute (due to generator and relation computations in free groups). [Carlsson] [Hatcher]

Simplicial Homology


Homology groups are easier to compute.*

$$H_n = \frac{Z_n}{B_n} = \frac{\text{ker}(\delta_n)}{\text{im}(\delta_{n+1})}$$
Homology from Edelsbrunner and Harer

*Technical conditions apply.

Betti Numbers


The \(k^{\text{th}}\) Betti number is the number of generators of the \(k^{\text{th}}\) homology group.


$$\beta_k = \text{rank}(H_k) = \text{log}_2 \frac{|\text{ker}(\delta_k)|}{|\text{im}(\delta_{k+1})|}$$

We may calculate Betti numbers using the Smith normal forms of matrix represenations of the boundary operators \(\delta_i\).


The \(k^{\text{th}}\) Betti number counts equivalence classes of \(k\)-dimensional surfaces (the number of “\(k\)-dimensional holes”) in the space.

Wedge of Circles


For example, let \(X = S^1 \vee S^1\) be the wedge of two circles.

Wedge of two circles

What are the Betti numbers of \(X\)?


\(\beta_0 = \) the number of connected components \(= 1\)
\(\beta_1 =\) the number of one-dimensional holes \(= 2\)
\(\beta_k =\)\( 0 \quad \text{ for all } k > 1\)

Betti Numbers of a Dataset


Suppose we have a dataset containing \(n\) points in \(\mathbb{R}^d\).

What are the Betti numbers of that set?


\(\beta_0 = \) the number of connected components \(= n\)
\(\beta_1 =\) the number of one-dimensional holes \(= 0\)
\(\beta_k =\)\( 0 \quad \text{ for all } k > 0\)



Okay, that’s fine... but not very interesting.

Simplicial Complexes


Simplicial complexes are generalizations of graphs.

Graphs are simplicial complexes.

Instead of edges that connect two vertices, simplicial complexes have simplices which connect any number of vertices.

A simplicial complex

A Formal Definition


A simplicial complex is a finite collection of sets \(A\) such that \(\alpha \in A\) and \(\beta \subset \alpha\) implies \(\beta \in A\).
The sets in \(A\) of cardinality \(k+1\) are called \(k\)-simplices.
Simplices

We often blur the distinction between an abstract simplicial complex and its geometric realization in Euclidean space.

Čech Complex


The Čech complex is the nerve of the balls of radius \(r\) around each point.

$$\text{Čech}(S, r) = \big\{\sigma \subset S : \bigcap_{x \in \sigma} B_x(r) \neq \varnothing \big\}$$

Vietoris-Rips Complex


The Vietoris-Rips complex contains all edges between points that are within \(2r\) along with every higher-dimensional simplex whose edges are all included.

$$\text{Vietoris-Rips}(S, r) = \big\{\sigma \subset S : \text{diam } \sigma \le 2r \big\}$$

Vietoris-Rips Lemma



Let \(S\) be a finite set of points in \(\mathbb{R}^d\), and let \(r > 0\). Then $$\text{Čech}(S, r) \subset \text{Vietoris-Rips}(S, r) \subset \text{Čech}(S, \sqrt{2}r).$$

We tend to favor Vietoris-Rips complexes since they are easier to compute.

Caveat


Both Čech and Vietoris-Rips complexes could produce a simplicial complex of higher dimension than the space we started in!

Delaunay Complex


$$\text{Delaunay}(S) = \big\{\sigma \subset S : \bigcap_{u \in \sigma} V_u \neq \varnothing \big\}$$ where \(V_u\) is the Voronoi cell of \(u\).

Interactive Voronoi Diagram Generator

Alpha Complex


$$\text{Alpha}(S) = \big\{\sigma \subset S : \bigcap_{u \in \sigma} R_u(r) \neq \varnothing \big\} \\ \quad \quad \quad \quad \; \, = \text{Čech}(S, r) \cap \text{Delaunay}(S)$$
where \(R_u(r) = B_u(r) \cap V_u\).

(Lazy) Witness Complex


In practice using all of the points in a dataset can be wasteful in terms of computing time and memory. The Witness construction strategically chooses landmark points and creates a simplicial complex that is (hopefully) still a faithful topological representation of the underlying space.

Lazy-Witness : Witness : : Vietoris-Rips : Čech

Another Caveat


Real world data is often messy, and chosing a particular radius \(r\) can yield a simplicial complex with artificial features.

Complex with artifacts
\(\beta_1 = 3\)
Complex with different artifacts
\(\beta_1 = 2\)

Persistent Homology


The features that persist for a large range of resolutions are more likely to represent genuine features of the underlying space.

Software for TDA


Plex: Simplicial Complexes in Matlab


Plex 2.5.1
Authors:Patrick Perry
Vin de Silva
Started:2000
Updated:2006 (deprecated)
Languagefilesblankcommentcode
C++852709384310206
Bourne Shell684313398280
C/C++ Header74197364995209
MATLAB6762921652219
m471581091110
make1710953206
SUM:25664211400827230

JPlex: Persistent Homology Library


JPlex (PLEX3)
Authors:Henry Adams
John Chakerian
Started:2006
Updated:2011 “phased out”
Languagefilesblankcommentcode
HTML623048128324667
Java57167468008969
Bourne Shell481610167759
Bourne Again Shell16988570
C++117530370
MATLAB9106398309
R2445170
Ant1200143
SUM:1415977963443020

JavaPlex: Persistent Homology Library


javaPlex 4.2.0
Authors:Andrew Tausz
Mikael Vejdemo-Johansson
Henry Adams
Started:2010
Updated:yesterday
Languagefilesblankcommentcode
HTML405158917715130927
Java25956951262023457
Bourne Shell681910167773
MATLAB190176914464656
Bourne Again Shell16988570
C++117530370
Ant13211223
R2445170
Arduino Sketch13263164
Visual Basic1290122
SUM:8712458123008168498

Phom: Persistent Homology in R


phom 1.0.3
Authors:Andrew Tausz
Started:2011
Updated:February 2014
Languagefilesblankcommentcode
C/C++ Header214251921675
C++14719176
R25945118
SUM:245312561969

Mapper


Mapper
Authors:Gurjeet Singh
Started:2007
Updated:2009
Languagefilesblankcommentcode
MATLAB656215227
HTML12138114
SUM:777253341

Python Mapper


Python Mapper 0.1.6
Authors:Daniel Müllner
Aravindakshan Babu
Started:2011
Updated:March 2014
Languagefilesblankcommentcode
Bourne Shell83907451925464
m471063919850
Python1917487659131
C++13291402705
C/C++ Header110484466
HTML2200296
DOS Batch1231166
make2245130
SUM:417218560548208

Dionysus: Persistent Homology Library


Dionysus
Authors:Dmitriy Morozov
Started:2006
Updated:December 2013
Languagefilesblankcommentcode
C/C++ Header5011626924842
C++1618767870
Python177348265
CMake102114125
make15010
SUM:9414488216112

CTL: Computational Topology Library


CTL
Authors:Ryan H. Lewis
Started:August 2013
Updated:July 2014
Languagefilesblankcommentcode
C/C++ Header5885630475895
C++2230911091814
CMake21113168393
SUM:101127843248102

CHomP: Computational Homology Project





People at Rutgers are also doing things. I'm not sure exactly what, but it does involve 40,000+ lines of C/C++ code related to persistent homology.


Phom Example 1


R code:

N <- 50
x1 <- rnorm(N, mean=0, sd=1)
y1 <- rnorm(N, mean=0, sd=1)
z1 <- rnorm(N, mean=0, sd=1)
x2 <- rnorm(N, mean=8, sd=1)
y2 <- rnorm(N, mean=8, sd=1)
z2 <- rnorm(N, mean=8, sd=1)
V1 <- cbind(x1, y1, z1)
V2 <- cbind(x2, y2, z2)
V <- as.matrix(rbind(V1, V2))
plot(V)
Plot in R^2

(Plotted with processing.js.)

R code:

library('phom')
#### Loading required package: Rcpp

max_f <- 9

dim <- 0
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')

dim <- 1
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')

dim <- 2
intervals <- pHom(V, dim, max_f, mode='vr', metric='euclidean')
plotBarcodeDiagram(intervals, dim, max_f, title = '')

Note that "vr" stands for the Vietoris-Rips complex.

Barcode for \(\beta_0\)


Barcode dimension 0

Barcode for \(\beta_1\)


Barcode dimension 1

Barcode for \(\beta_2\)


Barcode dimension 2

Mapper does partial clustering to collapse data along a filter function and produce a simplicial complex.

Mapper Complex


The “Mapper complex” is the nerve of all partial clusters taken on the preimages of a covering of the range of a filter function \(f : S \to Z\).


$$\text{Mapper}(S, f, \mathcal{A}, \text{Clust}) = \big\{\sigma \subset P : \bigcap_{s \in \sigma} s \neq \varnothing \big\}$$
where \(\mathcal{A}\) is a cover of the range of \(f : S \to Z\), \(\text{Clust} : \mathcal{P}(S) \to \mathcal{P}(\mathcal{P}(S))\) is a clustering function, and

$$P = \bigcup_{U \in \mathcal{A}} \text{Clust}\big(f^{-1}(U)\big).$$

A Circle


Matlab:

X = randn(300, 2);
X = X./(sqrt(sum(X.*X,2))*ones(1, 2));
d = L2_distance(X',X',1);
filter = d(1, :);
scatter(X(:,1),X(:,2),1000,filter,'.');
axis equal;
Mapper results

Mapper Results


Matlab:

filterSamples = 5;
overlapPct = 40;
[adja, nodeInfo, levelIdx] = mapper(d,...
 filter, 1/filterSamples, overlapPct);
%%%% 
for i=1:length(nodeInfo)
    ecc(i) = nodeInfo{i}.filter;
    setSize(i) = length(nodeInfo{i}.set);
end
writeDotFile('circle.dot', adja, ecc, setSize);

Bash:

$ neato -Tpng circle.dot -o mapper-output.png
Mapper results

Mapper Results in the Browser


writeJsonFile('circle.json', adja, ecc, setSize);

D3.js in Python Mapper


Miller-Reaven Diabetes Dataset


Diabetes flares of Miller and Reaven

Projection Pursuit

Reproduced Results Reproduced


R code:

require(locfit)
#### Loading required package: locfit
#### locfit 1.5-9.1 	 2013-03-22
data("chemdiab")
normdiab <- rbind(chemdiab)
for (i in 1:5) {
    normdiab[i] <- scale(chemdiab[i],center=FALSE)
}
write.csv(normdiab,'diab-norm.csv',row.names=FALSE)

Diabetes Flares




Diabetes results

Breast Cancer


Microarray Data

Gene Expression

NKI gene expression [van de Vijver]

Agilent Microarray

PAD = DSGA + Mapper


Progression Analysis of Disease is the one-two punch of Disease-Specific Genomic Analysis and Mapper.


DSGA = HSM + FLAT


DSGA is the one-two punch of decomposing disease data with a Healthy State Model created from normal tissue data and the FLAT construction.


Flat = data desparsing + PCA


Flat is the one-two punch of data desparsing and Principal Component Analysis.

NKI Breast Cancer Data


NKI Dataset

Normal Breast Tissue Data


UNC Microarray Database(my source)

Stanford Microarray Database(the paper's source)

Trying PAD


I replaced empty cells with NaN, deleted some poorly formatted rows, and used Matlab to impute missing data and change from log10 scale to log2 scale.


Matlab:

normal_bt = csvread('normal-breast-tissue-data.csv');
normal_bt_imputed = knnimpute(normal_bt, 10);
csvwrite('normal-breast-tissue-imputed.csv', normal_bt_imputed);

nki_data = csvread('nki-chang-complete-data.csv');
nki_data_imputed = knnimpute(nki_data, 10);
nki_data_imputed_log2 = nki_data_imputed .* 3.32192809489;
csvwrite('nki-chang-complete-imputed-log2.csv', nki_data_imputed_log2);

Online PAD Tool


c-MYB+ Cancer?


Ayasdi

By Muthu Alagappan

References

  1. A. Tausz, phom: Persistent Homology in R, Version 1.0.1, 2011. Available at CRAN http://cran.r-project.org.
  2. G. Carlsson. Topology and Data. Bull. Amer. Math. Soc., 46:255-308, 2009.
  3. H. Edelsbrunner, J. L. Harer, Computational Topology: An Introduction, AMS 2010
  4. N. De Jong, A Probabilistic Approach To Simplicial Homology, UTK, 2012
  5. A. J. Zomorodian, Topology for Computing, 2005
  6. ...

Thank you to everybody who came out to support me, especially my committee:

Dr. Fernando Schwartz

Dr. Michael Langston

Dr. Michael Berry