Sequential K-Means

Sequential K-means clustering is a modification to the traditional K-means method where the algorithm acts upon subsets of supplied data sequentially rather than in a single batch. This is important in two capacities:

It allows clusters and their associated centroids to be discerned without loading an entire dataset into memory.
It allows streaming/online updates to a model to be made thus providing the ability to react to changes in underlying supplied data

Both traditional and sequential K-means are initialized in the same manner - utilizing K-means++ or random assignment.

Within sequential K-means, a sample point x_t is used to update the closest cluster center from a value c_t-1 to c_t using the formula:

\[c_t=c_{t-1}+a(x_t-c_{t-1})\]

Where

Parameter	Description
\(a\)	Learning rate (0<a<1) defining how quickly a model changes based on new information.

The a value above controls the rate at which the concept of forgetfulness within the algorithm is applied. Forgetful K-means describes a method by which K-means clustering can be modified to account for changes to the position of clusters over time, unlike traditional K-means the introduction of a forgetful component allows the model to evolve and react to new information. This is a balance between prioritising recent information vs the past.

The forgetful option provided with this implementation controls how this algorithm operates, if set to true 1b then the learning rate a is used in the above formula as a floating point value, if however forgetful is set to 0b then the value a in the above formula is set to 1/(n+1) where n is the number of points in cluster c_t-1.

By default in the below implementation the algorithm operates as a forgetful implementation with a learning rate of 0.1

.ml.online.clust.sequentialKMeans.fit

Fit a Sequential K-means model

.ml.online.clust.sequentialKMeans.fit[X;df;k;centers;config]

Parameters:

name	type	description
`X`	`any`	Input/training data of N dimensions.
`df`	`symbol`	Distance function used in clustering. This must be one of `edist` or `e2dist` denoting the euclidean distance and euclidean squared distance respectively.
`k`	`long`	The number of clusters.
`centers`	`dictionary\|null`	Initial cluster centers. If null, initial centers are calculated using k++/random initialisation. If dictionary, must contain `num` and `centroids` which define the number of points in a cluster and the cluster location often calculated from a previous 'fit' phase.
`config`	`dictionary`	Any modifications to be applied during the fitting process of Sequential K-means (See here for more details).

In the above function, the following are the optional entries for config:

name	type	description	default
`init`	boolean	Cluster centroid initialization - K-means++/randomization (`1b/0b`)	`1b`
`a`	float	Learning rate	`0.1`
`forgetful`	boolean	Forgetful/normal sequential K-Means (`1b/0b`)	`1b`

Returns:

type	description
`dictionary`	All information collected during the fitting of a model, along with prediction and update functionality.

Note that the information collected during the fitting of the model is contained within the modelInfo key and includes:

name	type	description
`num`	`long[]`	Number of data points in each cluster.
`centroids`	`float[]`	Cluster centers associated with each cluster.
`inputs`	dictionary	Configuration inputs used when fitting the model.

Prediction functionality is contained within the predict key. The function takes X, the feature data which is to be predicted, as input and returns the predicted values. While the update function is stored within the update key and again takes X as input and returns an updated representation of the original model.

Examples:

Example 1: Fit a vanilla sequential K-means algorithm

// Generate a datasetq)data:([]1000?10f;1000?10f)// Select random sample of points to pass to K-meansq)sample:neg[500]?data// Set parametersq)df:`e2dist // Distance function - Euclidean squaredq)k:3 // Number of clustersq)centers:(::) // Cluster informationq)config :(::) // Additional parameters// Fit K-means based on sampleq)show model:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]modelInfo| `num`centroids`inputs!(170 180 150;(5.59595 2.110237 6.841375;8.10..predict | {[returnInfo;data] modelInfo:returnInfo`modelInfo; data:clust...update | {[returnInfo;data] modelInfo:returnInfo`modelInfo; inputs:mode..// Calculated centroidsq)model[`modelInfo]`centroids5.59595 2.110237 6.8413758.104928 4.878808 3.219346// Use the fitted model to predict on new dataq)newData:2 50#100?10.q)model.predict newData2 1 1 1 1 1 0 1 0 1 1 0 0 0 2 0 0 0 0 2 1 1 1 2 1 0 0 1 0 1 1 0 0 2 1 1 0 2 1..// Update the fitted model with new dataq)show updModel:model.update newDatamodelInfo| `num`centroids`inputs!(183 199 168;(6.092505 1.509578 6.613747;8.5..predict | {[returnInfo;data] modelInfo:returnInfo`modelInfo; data:clust...update | {[returnInfo;data] modelInfo:returnInfo`modelInfo; inputs:mode..// Updated centroidsq)updModel[`modelInfo]`centroids6.092505 1.509578 6.6137478.509002 5.429571 2.729484

Example 2: Fit a sequential K-means algorithm providing initial cluster centers and using the non-forgetful version of the algorithm

// Generate a datasetq)data:([]1000?10f;1000?10f)// Select random sample of points to pass to K-meansq)sample:neg[500]?data// Set parametersq)df:`e2dist // Distance function - Euclidean squaredq)k:3 // Number of clustersq)centers:neg[1]_model[`modelInfo] // Cluster centers from previous fittingq)config :enlist[`forgetful]!enlist 0b // Additional parameters// Fit K-means based on sampleq)show model1:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]modelInfo| `num`centroids`inputs!(337 353 310;(5.874493 2.054478 6.938887;8.1..predict | {[returnInfo;data] modelInfo:returnInfo`modelInfo; data:clust...update | {[returnInfo;data] modelInfo:returnInfo`modelInfo; inputs:mode..// Calculated centroidsq)model1[`modelInfo]`centroids5.874493 2.054478 6.9388878.124769 4.845081 3.034611// Use the fitted model to predict on new dataq)newData:2 50#100?10.q)model1.predict newData0 2 2 2 0 2 1 0 1 0 1 1 0 1 2 1 2 0 1 0 2 1 2 1 1 1 1 0 1 1 1 2 1 2 1 1 1 2 2..// Update the fitted model with new dataq)show updModel1:model1.update newDatamodelInfo| `num`centroids`inputs!(350 376 324;(5.896282 2.03924 6.971541;8.14..predict | {[returnInfo;data] modelInfo:returnInfo`modelInfo; data:clust...update | {[returnInfo;data] modelInfo:returnInfo`modelInfo; inputs:mode..