Hierarchical Clustering on top of Topological Summary

We stack Hierarchical Clustering (by approximating the Minimum Quartet Tree Cost) on top of a topological summary (Mapper with a t-SNE lens)

Topological Summary

We first create a topological summary (compressed representation) from our data set with Topological Data Analysis. For this we use an algorithm that is very close to the Mapper algorithm from Singh et al.

We reduce our 8x8 digit data to 2 dimensions using t-SNE (via scikit-learn). We map this projection with overlapping intervals (30 intervals with 33% overlap). We cluster the points inside these intervals (using DBSCAN).

For every cluster we create a canonical "eigen-digit" by averaging the pixels from all the cluster members. This idea is based on Harlan Sexton's blog "The Beautiful Duality of TDA".

Eigendigit 6
Example of a node with the canonical digit "6".

Hierarchical Clustering

Hierarchical Clustering by approximating the Minimum Quartet Tree Cost (MQTC) uses an algorithm from Cilibrasi et al. This algorithm mutates trees by swapping leafs and subtrees. It searches for trees with a high fitness. The result is an unrooted dendrogram. Every internal node is connected to exactly 3 other nodes or leafs.

We first create a pairwise distance matrix ( Euclidean distance with log-transform) from all the canonical digits. This serves as input for the MQTC algorithm. The algorithm then generates a tree with a high fitness (a high fitness is incorporating as much consistent distance-weighted quartet topologies as possible). The above tree has a fitness of ~0.885.

Preliminary

We can distinguish between two types of "9"'s. One type is close to a "0", because the bottom loop follows the "0"-curve. The other type is closer to a "4", with almost no bottom loop.

We can also spot that the digits "1" and "0" are very dissimilar. From "1" follow narrow "4"'s and narrow "7"'s. "1"'s with a distinguished top dash are closer to "2".

Update: More digits

Repeated the same experiment with 39 canonical/archetypal digits. Tree fitness: 89.57%