Unsupervised Anomaly Detection on Wisconsin Breast Cancer Data

Hypothesis

It is possible to detect breast cancer in an unsupervised manner.

Setup

We use the Isolation Forest [PDF] (via Scikit-Learn) and L^2-Norm (via Numpy) as a lens to look at breast cancer data.

The Isolation Forest gives an anomaly score for every data point, depending on how many splits of the data it took to isolate the point. Outliers, on average, need less splits to be isolated.

L^2-Norm is a vector norm defined for complex vectors. Square all features and add them together, then take the square root. Outliers have a different L^2-Norm than the typical data point.

The Isolation Forest anomaly scores and the L^2 Norm together become a 2-dimensional lens.

The Mapper algorithm [PDF] (via Kepler-Mapper) is then used to create a simplicial complex from the data. Nodes are colored by the average ratio of target variable (1 = Malignant, 0 = Benign). This is then visualized as a D3.js-graph.

Result

We see an almost perfect separation between the classes, confirming our hypothesis. With rule mining, we can now drill down into the patterns that best describe each cluster. Then we can form new hypotheses and gain insights into the data, without asking specific questions. These automated insights can be used to improve manual detection or supervised classification algorithms.

Code

import pandas as pd
import numpy as np
import km
from sklearn import ensemble

# For data we use the Wisconsin Breast Cancer Dataset
# Via: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
df = pd.read_csv("data.csv")
feature_names = [c for c in df.columns if c not in ["id", "diagnosis"]]
df["diagnosis"] = df["diagnosis"].apply(lambda x: 1 if x == "M" else 0)
X = np.array(df[feature_names].fillna(0)) # quick and dirty imputation
y = np.array(df["diagnosis"])

# We create a custom 1-D lens with Isolation Forest
model = ensemble.IsolationForest(random_state=1729)
model.fit(X)
lens1 = model.decision_function(X).reshape((X.shape[0], 1))

# We create another 1-D lens with L2-norm
mapper = km.KeplerMapper(verbose=3)
lens2 = mapper.fit_transform(X, projection="l2norm")

# Combine both lenses to create a 2-D [Isolation Forest, L^2-Norm] lens
lens = np.c_[lens1, lens2]

# Create the simplicial complex
graph = mapper.map(lens, 
                   X, 
                   nr_cubes=15, 
                   overlap_perc=0.4, 
                   clusterer=km.cluster.KMeans(n_clusters=2, 
                                               random_state=1618033))

# Visualization
mapper.visualize(graph, 
                 path_html="breast-cancer.html", 
                 title="Wisconsin Breast Cancer Dataset", 
                 custom_tooltips=y, 
                 color_function="average_signal_cluster")

Visualization

TDA Wisconsin Breast Cancer Dataset

Click here for an interactive Live Version

Increasing the overlap % to 0.7 shows more structure:

TDA Wisconsin Breast Cancer Dataset

Acknowledgements

Thanks to BigML for the idea to use Isolation Forest on this dataset.