Spectral Bridges

sbcluster is a Python package that implements a novel clustering algorithm combining k-means and spectral clustering techniques, called Spectral Bridges. It leverages efficient affinity matrix computation and merges clusters based on a connectivity measure inspired by SVM’s margin concept. This package is designed to provide robust clustering solutions, particularly suited for large datasets.

Features

  • Spectral Bridges Clustering Algorithm: Integrates k-means and spectral clustering with efficient affinity matrix calculation for improved clustering results.

  • Scalability: Designed to handle large datasets by optimizing cluster formation through advanced affinity matrix computations.

  • Customizable: Parameters such as number of clusters, iterations, and random state allow flexibility in clustering configurations.

  • Model selection: Automatic model selection for number of nodes (m) according to a normalized eigengap metric.

  • scikit-learn: Native integration with the standard API, with easy options for model selection and evaluation.

Speed

Spectral Bridges utilizes fastkmeanspp’s efficient implementation for KMeans, which makes it remarkably fast even with large scale datasets.

Installation

You can install the package via pip:

pip install sbcluster

Usage

Example:

from time import time

import matplotlib.pyplot as plt
import numpy as np
from sbcluster import SpectralBridges, ngap_scorer
from sklearn.cluster import SpectralClustering
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import GridSearchCV

# Load some synthetic data
data = np.genfromtxt("datasets/impossible.csv", delimiter=",")
X, y = data[:, :-1], data[:, -1]

# Define the parameter grid
param_grid = {"n_clusters": [2, 3, 4, 5, 6, 7, 8, 9, 10]}
cv = [(np.arange(X.shape[0]), np.arange(X.shape[0]))] * 5

# Perform grid search for optimal parameters
grid_search = GridSearchCV(
   estimator=SpectralBridges(n_clusters=2, n_nodes=250),
   param_grid=param_grid,
   scoring=ngap_scorer,
   cv=cv,
   verbose=1,
)

# Fit the grid search
grid_search.fit(X)

# Print the results
print(grid_search.cv_results_["mean_test_score"])
print(grid_search.best_params_)

# Make predictions with the best model
guess = grid_search.best_estimator_.predict(X)
ari = adjusted_rand_score(y, guess)

# Print the ARI
print(f"Adjusted Rand Index: {ari}")

# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=guess, alpha=0.1)
plt.scatter(
   grid_search.best_estimator_.cluster_centers_[:, 0],
   grid_search.best_estimator_.cluster_centers_[:, 1],
   c=grid_search.best_estimator_.cluster_labels_,
   marker="X",
)
plt.title("Clustered data and centroids with best SpectralBridges fit")
plt.show()

# Compare with sklearn's SpectralClustering
sc_low = SpectralClustering(n_clusters=7).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=sc_low.labels_)
plt.title("Spectral Clustering of the original dataset, gamma=1.0")
plt.show()

sc_high = SpectralClustering(n_clusters=7, gamma=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=sc_high.labels_)
plt.title("Spectral Clustering of the original dataset, gmma=5.0")
plt.show()

# Comapre times
start = time()
grid_search.best_estimator_.fit(X)
end = time()
print("SpectralBridges fit time:", end - start)

start = time()
sc_low.fit(X)
end = time()
print("SpectralClustering fit time:", end - start)

API Reference

class SpectralBridges(n_clusters, n_nodes, *, p=2, n_iter=20, n_local_trials=None, random_state=None, tol=1e-08, affinity_transform=<sbcluster._defs.ExpQuantileTransform object>)[source]

Bases: BaseEstimator, ClusterMixin

Spectral Bridges clustering algorithm.

Variables:
  • n_clusters (int) – The number of clusters to form.

  • n_nodes (int) – Number of nodes or initial clusters.

  • p (int | float) – Power of the alpha_i.

  • n_iter (int) – Number of iterations to run the k-means algorithm.

  • n_local_trials (int | None) – Number of seeding trials for centroids initialization.

  • random_state (int | None) – Determines random number generation for centroid initialization.

  • tol (float) – Tolerance for the normalized eigengap.

  • affinity_transform (AffinityTransform) – Affinity transform to apply to the affinity matrix.

  • cluster_centers (np.ndarray | None) – Coordinates of cluster centers.

  • cluster_labels (np.ndarray | None) – Labels of each cluster.

  • labels (np.ndarray | None) – Labels of each data point.

  • affinity_matrix (np.ndarray | None) – Affinity matrix.

  • ngap (float | None) – The normalized eigengap.

Parameters:
  • n_clusters (int)

  • n_nodes (int)

  • p (int | float)

  • n_iter (int)

  • n_local_trials (int | None)

  • random_state (int | None)

  • tol (float)

  • affinity_transform (AffinityTransform)

cluster_labels_: ndarray | None
labels_: ndarray | None
n_clusters: int
n_nodes: int
p: int | float
n_iter: int
n_local_trials: int | None
random_state: int | None
tol: float
affinity_transform: AffinityTransform
cluster_centers_: ndarray | None
ngap_: float | None
affinity_matrix_: ndarray | None
fit(X, y=None)[source]

Fit the Spectral Bridges model on the input data X.

Parameters:
  • X (npt.ArrayLike) – Input data to cluster.

  • y (None, optional) – Placeholder for y.

Raises:
  • ValueError – If the number of samples is less than the number of clusters.

  • ValueError – If X contains inf or NaN values.

Returns:

The fitted model.

Return type:

Self

predict(X)[source]

Predict the nearest cluster index for each input data point x.

Parameters:

X (npt.ArrayLike) – The input data.

Raises:
  • ValueError – If X contains inf or NaN values.

  • ValueError – If self.cluster_centers_ and self.cluster_labels_ are not set.

Returns:

np.ndarray The predicted cluster indices.

Return type:

ndarray