Clustering

Quickly cluster your data (By default we cluster data into 25 clusters using Sklearn KMeans.)

from relevanceai import Client

client = Client(token=YOUR_ACTIVATION_TOKEN)
ds = client.Dataset("quickstart")

vector_fields = ["word_vector_"]
cluster_ops = ds.cluster(vector_fields=vector_fields)

Use any Scikit-learn model

Letting you freely adjust all the parameters for clustering. For non-centroid based methods we calculate a grand centroid by averaging all the vectors in the cluster.

from sklearn.cluster import AgglomerativeClustering

cluster_model = AgglomerativeClustering()

vector_fields = ["word_vector_"]
cluster_ops = ds.cluster(vector_fields=vector_fields, model=cluster_model, alias="agglomerative")

Setting an alias helps you identify between different clustering you've done. The naming convention of the field is:
_cluster_.{vector_field}.{alias} and it will create cluster labels like this:

{"_cluster_" : { "word_vector": { "agglomerative": "cluster_1" } } }

Use with HDBSCAN

HDBSCAN is an alternative to DBSCAN that extracts more flat cluster based on stability. You can read more about it here: https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

from hdbscan import HDBSCAN

cluster_model = HDBSCAN()

vector_fields = ["word_vector_"]
cluster_ops = ds.cluster(vector_fields=vector_fields, model=cluster_model, alias="hdbscan")

📘

Outlier clustesr

Methods like HDBSCAN will produce outlier clusters. They are labelled as "cluster_-1" in Relevance.

Retrieve cluster centroids

cluster_ops.centroids