Machine Learning

q Python

.qsp.ml.adaBoostClassifier[X;y;prediction;.qsp.use (!) . flip (
 (`nEstimators ; nEstimators);
 (`learningRate; learningRate);
 (`algorithm ; algorithm);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the features values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value `1`.	`50`
`learningRate`	`float`	Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is `0.0`.	`1.0`
`algorithm`	`string`	Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. This value can be `SAMME` for stagewise additive modeling or `SAMME.R` for real-valued stagewise additive modeling.	`SAMME.R`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with an adaBoost classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.adaBoostClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.adaBoostClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.adaBoostRegressor[X;y;prediction]
.qsp.ml.adaBoostRegressor[X;y;prediction;.qsp.use (!) . flip (
 (`nEstimators ; nEstimators);
 (`learningRate; learningRate);
 (`loss ; loss);
 (`modelInit ; modelInit);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`50`
`learningRate`	`float`	Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. This value must lie in the range `(0.0, inf)`.	`1.0`
`loss`	`string`	Loss function used to update the contributing weights of the regressors after each boosting iteration. This can be `linear` for linear loss, `square` for mean squared error, or `exponential` for exponential loss.	`"linear"`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit an adaBoost regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.adaBoostRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"AdaModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.adaBoostRegressor[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.affinityPropagation[X;cluster;.qsp.use (!) . flip (
 (`damping ; damping);
 (`maxIter ; maxIter);
 (`convergenceIter; convergenceIter);
 (`affinity ; affinity);
 (`randomState ; randomState);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns by using all non-categorical columns.	Required
`cluster`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `cluster` is used.	Required

options:

name	type	description	default
`damping`	`float`	Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range `[0.5, 1.0)`.	`0.5`
`maxIter`	`int`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`200`
`convergenceIter`	`int`	Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is `1`.	`15`
`affinity`	`string`	Statistical measure used to define similarities between the representative points. This value can be `euclidean` to use negative squared Euclidean distance or `precomputed` to use the values in the data's distance matrix.	`euclidean`
`randomState`	`int`	Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If set to `::`, the randomness is based off the current timestamp.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`clusters`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a affinityPropagation clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.affinityPropagation[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"AffinityPropagationModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
 | x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.affinityPropagation[xFunc;clustFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.birch[X;cluster;.qsp.use (!) . flip (
 (`threshold ; threshold);
 (`branchingFactor; branchingFactor);
 (`nClusters ; nClusters);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns by using all non-categorical columns.	Required
`cluster`	`symbol, ::, or function`	Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `cluster` is used.	Required

options:

name	type	description	default
`threshold`	`float`	Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to the cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is `0.0`.	`0.5`
`branchingFactor`	`int`	Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is `1`.	`50`
`nClusters`	`int`	Final number of clusters to be defined by the model. Minimum value is `2`.	`3`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`clusters`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a Birch clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the birch operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.birch[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"BirchModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
 | x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the birch operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.birch[xFunc;clustFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.cure[X;cluster;.qsp.use (!) . flip (
 (`df ; df);
 (`n ; n);
 (`c ; c);
 (`cutDict ; cutDict);
 (`bufferSize; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns by using all non-categorical columns.	Required
`cluster`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `cluster` is used.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`n`	`int`	Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is `1`.	`2`
`c`	`float`	Compression factor used for grouping the representative points together. Minimum value is `0.0`.	`0.0`
`k`	`int`	Final number of clusters to be defined by the model. Minimum value is `2`. The distance used when cutting the dendrogram will be adjusted to fit this number so only specify one of the parameters `k` or `dist`. If set to `::`, the `dist` parameter will be used. If both are set to `::`, the `cutDict` parameter will be used.	`::`
`dist`	`float`	Distance between leaves at which to cut the dendrogram to define the clusters. Minimum value is `0.0`. The number of clusters will be dynamic based on this distance so only specify one of the parameters `k` or `dist`. If set to `::`, the `k` parameter will be used. If both are set to `::`, the `cutDict` parameter will be used.	`::`
`cutDict`	`dict`	A dictionary that defines the cutting algorithm used when splitting the data into clusters. This can be used to define a `k` value or a `dist` value (documentation for these above).	enlist[`k]!enlist 3
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`clusters`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a cure clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the cure operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.cure[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"cureModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
 | x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the cure operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.cure[xFunc;clustFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.dbscan[X;cluster;.qsp.use (!) . flip (
 (`df ; df);
 (`minPts ; minPts);
 (`eps ; eps);
 (`bufferSize; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns by using all non-categorical columns.	Required
`cluster`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `cluster` is used.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`minPts`	`int`	Minimum number of points required to be close together before this group of points is defined as a cluster. The maximum distance these points are to be away from one another must be less than or equal to the `Maximum Distance Between Points` parameter. Minimum value is `1`.	`2`
`eps`	`float`	Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is `0.0`.	`1.0`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`clusters`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a dbscan clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.dbscan[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"dbscanModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
 | x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.dbscan[xFunc;clustFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.decisionTreeClassifier[X;y;prediction;.qsp.use (!) . flip (
 (`criterion ; criterion);
 (`splitter ; splitter);
 (`maxDepth ; maxDepth);
 (`minSamplesSplit; minSamplesSplit);
 (`minSamplesLeaf ; minSamplesLeaf);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `gini`, to use the Gini impurity measure, or `entropy`, to use the information gain measure.	`gini`
`splitter`	`string`	Strategy used to split the nodes in the tree. This can be `best` to choose the best split or `random` to choose the best random split.	`best`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value.	`::`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a decision tree classifier model on data and store the model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.decisionTreeClassifier[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTCModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry.
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.decisionTreeClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.decisionTreeRegressor[X;y;prediction]
.qsp.ml.decisionTreeRegressor[X;y;prediction;.qsp.use (!) . flip (
 (`criterion ; criterion);
 (`splitter ; splitter);
 (`maxDepth ; maxDepth);
 (`minSamplesSplit; minSamplesSplit);
 (`minSamplesLeaf ; minSamplesLeaf);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `squared_error` for mean squared error, `friedman_mse` for mean squared error with Friedman's improvement score, `absolute_error` for mean absolute error, or `poisson` for Poisson deviance.	`"squared_error"`
`splitter`	`string`	Strategy used to split the nodes in the tree. This can be `best` to choose the best split or `random` to choose the best random split.	`"best"`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a decision tree regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.decisionTreeRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.decisionTreeRegressor[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], dictionary, or ::`	Name of the column(s) in the input table to remove because they contain a constant value throughout. Can also be a dictionary mapping column names to their associated constant values whereby only columns with these names and values will be dropped. If set to `::`, the operator will be applied to all columns in the data that contain a constant value throughout.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before dropping the constant columns from the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with the constant valued columns no longer in the table.

The columns to be removed from the data are either specified by the user beforehand, through a list or dictionary, or these columns are determined using the .ml.dropConstant function. This function checks the data for columns that contain a constant value throughout. If a non-constant column is supplied, an error is thrown.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Drop the constant columns protocol and response.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.dropConstant[`protocol`response]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);

Example 2: Drop the columns id and ratio, checking that their values match the expected constant values.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.dropConstant[`id`ratio!(1; 2f)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] id: 1; ratio: 2f; data: 10?10f);

Example 3: Drop columns whose value is constant for all buffered records.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)

sp.ml.drop_constant('lastPrice')

Parameters:

name	type	description	default
`columns`	`symbol, symbol[], dictionary, or ::`	Columns to drop. Either a column name as string to drop a single column, or a list of strings to drop several columns, or a dictionary with column names and expected constant values as the key-value pairs to drop several columns, or `None` to drop all constant columns. Note: The columns to be dropped will be determined by the first batch and stored for application to subsequent batches.	Required

options:

name	type	description	default
`buffer_size`	`long`	The number of data points which must amass before columns are dropped.	`0`

Returns:

A pipeline comprised of a 'drop_constant' operator, which can be joined to other pipelines.

Examples:

Drop a constant column from a batch the data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.drop_constant('x')
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.ones(10),
'a': np.random.randn(10),
'b': np.random.randn(10)
})
>>> kx.q('publish' , data)

a          b
---------------------
1.190654   0.3066241
-0.4255034 1.713646
-1.185454  -0.6019988
-2.161811  0.7125692
-1.934769  0.2542738
2.057811   -0.4257737
0.4890321  0.9682337
-0.1086504 1.506568
-1.311511  -1.68155
0.07270301 0.8865468

q Python

.qsp.ml.featureHasher[X;n]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	Name of the column(s) in the data to perform the feature hashing on.	Required
`n`	`long`	Number of resulting numeric columns to represent each specified column.	Required

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input table with each specified column now replaced by `n` new numeric columns which contain the hashed feature values.

This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.

It converts each chosen column into n columns, sending each string/symbol to its truncated hash value. The hash function employed is the signed 32-bit version of Murmurhash3.

As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Encode a single categorical column.

// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.featureHasher[`location; 10]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its features through hashing
publish ([] location: 20?`london`paris`berlin`miami; num: til 20);

Example 2: Encode multiple categorical columns.

// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.featureHasher[`srcIP`destIP; 14]
 .qsp.write.toVariable[`output];
// Pass a batch of data to the stream processor to encode its features through hashing
IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);

Coming Soon

q Python

.qsp.ml.registry.fit[X;y;untrained;modelType;prediction]
.qsp.ml.registry.fit[X;y;untrained;modelType;prediction; .qsp.use (!) . flip (
 (`bufferSize; bufferSize);
 (`modelArgs ; modelArgs);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol[], ::, or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the features are inferred from the y value.	Required
`y`	`symbol, function, or ::`	Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use. This must be `::` when training an unsupervised model.	Required
`untrained`	`function`	An untrained q/sklearn model that we want to fit.	Required
`modelType`	`string`	Whether the model we are fitting is a `"q"` model or an `"sklearn"` model.	Required
`prediction`	`symbol, ::, or function`	Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the symbol `prediction` is used.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelArgs`	`list`	List of arguments passed to the model to help configure the fitting process.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under 'unnamedExperiments'.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

For all common arguments, refer to configuring operators

Returns:

type	description
`any`	The current batch, modified in accordance with the `prediction` parameter.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter. In unsupervised models, this value is set to `::`.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.

N.B. This is only for models that cannot be trained incrementally. For other models, .qsp.ml.registry.update should be used.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a q model.

// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);
// Define optional model fitting parameters
optKeys:`model`registry`experiment`modelArgs;
optVals:("sgdLR";::;::;(1b; `maxIter`gTol`seed!(100;-0w;42)));
opt:optKeys!optVals;
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.fit[
 `x`x1;
 `y;
 .ml.online.sgd.linearRegression;
 "q";
 `yhat;
 .qsp.use opt
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore[::;::]

Example 2: Fit an sklearn model.

// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5);
// Define and fit an sklearn model
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2];
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.fit[
 `x`x1;
 `y;
 rfc;
 "sklearn";
 `yhat
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Example 3: Fit an unsupervised q model.

// Generate initial data to be used for fitting
data:([]x:1000?1f;x1:1000?1f;x2:1000?1f);
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.fit[
 `x`x1`x2;
 ::;
 .ml.clust.kmeans;
 "q";
 `cluster;
 .qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Example 4: Fit a model while passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.fit[
 xFunc;
 yFunc;
 .ml.online.sgd.linearRegression;
 "q";
 predFunc;
 .qsp.use enlist[`modelArgs]!enlist(1b; `maxIter`gTol`seed!(100;-0w;42))
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data

sp.ml.fit(
kx.q('{delete y from x}'),
'y',
kx.q('.ml.online.sgd.logClassifier'),
sp.ml.ModelType.q,
'yhat',
registry='/tmp',
model='qLC',
model_args=[True, {'seed': 42, 'k': 5}]
)

Parameters:

name	type	description	default
`features`	`symbol[], ::, or function`	Column name of the predictor variable, or the function required to generate predictors from a batch.	Required
`targets`	`symbol, function, or ::`	Column name of the target variable, or the function required to generate predictors from a batch, or `None` if an unsupervised model is to be trained.	Required
`untrained`	`function`	Untrained model.	Required
`model_type`	`string`	A `ModelType` enum value, or the string equivalent.	Required
`prediction`	`symbol, ::, or function`	Function used to score the quality of the model or to join predictions into the batch.	Required
`udf`		Deprecated, replaced by prediction.

options:

name	type	description	default
`buffer_size`	`long`	Number of records to buffer before training the model, or False to fit on the first batch. When the batch size is exceeded, any additional records will also be included when training.	`0`
`model_args`	`list`	List of arguments to pass to the model after X and y. If there is only a single argument, it must be enlisted.	`::`
`model`	`string`	Name of the current model to be stored within the registry. If 'model' is omitted or `None` then the current model will not be saved.	`::`
`registry`	`string`	Local/cloud registry to save the model to.	`::`
`experiment`	`string`	Name of the experiment within the registry to save the current model to.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry. See Config Values table below.	`()!()`

Config Values:

name	type	description	default
`data`		If provided with 'data', the function will attempt to parse out relevant statistical information associated with the data for use within model deployment.
`requirements`		Option to add Python requirements information associated with a model. Either a boolean value `True` indicating the use of `pip freeze`, or a symbol indicating the path to a `requirements.txt` file, or a list of strings defining the requirements to be added.
`major`		Boolean value indicating if model version is to be incremented as a 'major' version i.e. should the model be incremented from `1 0` to `2 0` (`True`) rather than `1 0` to `1 1` as is default (`False`).
`major_version`		The major version to be incremented. By default the function will increment major versions based on the current maximum model version present within the registry. However, 'major_version' allows users to define which major version of a model they wish to be incremented.
`code`		Reference to the location of any `.py`, `.p`, `.k` or `.q` files. These files are then automatically loaded on retrieval of the models using the `.get.` functionality.
`axis`		Boolean value indicating the required data format for model training, i.e. should the data have `value flip` (`True`) or `flip value flip` (`False`) applied before training.
`supervise`		List of metrics to be used for supervised monitoring of the model.

Returns:

A fit operator, which can be joined to other operators or pipelines.

Examples:

Fit a model on a batch of data and view model in model store:

>>> from kxi import sp
>>> import pykx as kx
>>> import pytest
>>> from sklearn.cluster import KMeans
>>> from sklearn.linear_model import LinearRegression
>>> import pandas as pd
>>> import random
>>> def sp_pipeline(operator):
sp.run(sp.read.from_callback('publish')
| operator
| sp.write.to_variable('.test.cache', mode='overwrite'))
>>> bool_data = pd.DataFrame({
'x': [random.uniform(0, 1) for _ in range(100)],
'x1': [random.uniform(0, 1) for _ in range(100)],
'x2': sorted([random.uniform(0, 1) for _ in range(100)]),
'y': [bool(i) for i in range(1, 101)],
})
>>> sp_pipeline(
sp.ml.fit(kx.q('{delete y from x}'),
'y',
kx.q('.ml.online.sgd.logClassifier'),
sp.ml.ModelType.q,
'yhat',
registry='/tmp',
model='qLC',
model_args=[True, {'seed': 42, 'k': 5}]))
>>> kx.q('publish', bool_data)
>>> assert kx.q('100~count .test.cache')
>>> assert kx.q('{cols[.test.cache]~cols[x],`yhat}', bool_data)
>>> assert kx.q('01b~asc distinct .test.cache`yhat')
>>> kx.q('.test.cache')

x           x1         x2          y yhat
-----------------------------------------
0.438085    0.1456259  0.009515322 0 0
0.9236949   0.5339187  0.01069584  0 0
0.1721638   0.01717696 0.02064306  0 0
...
0.7064919   0.9793561  0.5711501   1 0
0.4621936   0.9414561  0.5720107   1 0
0.05334808  0.09241486 0.5788343   1 0
0.3619266   0.7338493  0.5957937   1 0
0.9695426   0.4179357  0.601046    1 0
0.6921203   0.6083793  0.6513348   1 0
0.09326991  0.9893873  0.6885329   1 0
0.1829138   0.7280614  0.7131808   1 0
0.945359    0.1291827  0.7154933   1 0
0.05329599  0.09970421 0.7177233   1 1
0.3780809   0.9164885  0.7378628   1 0
0.7895978   0.1609393  0.7734308   1 0
...

q Python

.qsp.ml.freshCreate[X;features]
.qsp.ml.freshCreate[X;features;.qsp.use enlist[`warn]!enlist warn]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	Name of the column(s) in the data to use for FRESH feature generation.	Required
`features`	`symbol, symbol[], or ::`	Name of the FRESH feature(s) we want to define from the data. A full list of these features can be found here.	Required

options:

name	type	description	default
`warn`	`boolean`	Show warnings 1b / Suppress warnings 0b.	`0b`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns a table containing the specified aggregated FRESH feature columns for each selected column in the input table.

Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.

For the feature parameter, if it is set to: :: - all features are applied. noHyperparameters - all features except hyperparameters are applied. noPython - all features that don't rely on Python are applied.

As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Build two features - absEnergy and max.

// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.window.tumbling[00:01:00; `time]
 .qsp.ml.freshCreate[`x; `absEnergy`max]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to create the fresh features
publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);

Example 2: Build all features.

// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.window.count[100]
 .qsp.ml.freshCreate[`x; ::]
 .qsp.write.toVariable[`output];
// Pass a batch of data to the stream processor to create the fresh features
publish ([] x: 500?1f; y: 500?100);

sp.ml.fresh_create('x', ['min', 'max', 'absEnergy'])

Parameters:

name	type	description	default
`columns`	`symbol or symbol[]`	Columns on which FRESH feature creation is to be applied, or `None` to apply to all columns within a batch.	Required
`features`	`symbol, symbol[], or ::`	Functions to be applied to each column in a batch as either a string naming the function to be applied, or list of strings naming the functions to be applied, or `None` to apply all applicable functions defined within 'kx.q.ml.fresh.params'. Note: A number of bespoke string values are supported including: 1. 'regression': Apply any functions appropriate for application on floating point data. 2. 'classification': Apply any functions appropriate for application on columns which contain 'classes'. 3. 'noHyperparameters': Apply any underlying functions which do not use configurable parameters.	Required

Returns:

A fresh_create operator, which can be joined to other operators or pipelines.

Examples:

Setting up a stream to compute three features, with a window to batch the data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.window.timer(np.timedelta64(10, 's'), count_trigger = 1)
| sp.ml.fresh_create('x', ['min', 'max', 'absEnergy'])
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({'x': np.random.randn(10)})
>>> kx.q('publish', data)

x_absEnergy x_max    x_min
------------------------------
14.30635    0.917339 -2.102371

q Python

.qsp.ml.gaussianNB[X;y;prediction;.qsp.use (!) . flip (
 (`priors ; priors);
 (`varSmoothing; varSmoothing);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`priors`	`float[]`	List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is `0.0`. If set to `::`, the priors will be adjusted according to the data.	`::`
`varSmoothing`	`float`	Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is `0`.	`1e-9`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a gaussian naive bayes model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.gaussianNB[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.gaussianNB[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.gradientBoostingRegressor[X;y;prediction]
.qsp.ml.gradientBoostingRegressor[X;y;prediction;.qsp.use (!) . flip (
 (`loss ; loss);
 (`learningRate ; learningRate);
 (`nEstimators ; nEstimators);
 (`minSamplesSplit; minSamplesSplit);
 (`minSamplesLeaf ; minSamplesLeaf);
 (`maxDepth ; maxDepth);
 (`modelInit ; modelInit);
 (`bufferSize ; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`loss`	`string`	Loss function that is optimized using gradient descent to get the best model fit. Can be `squared_error`, `absolute_error`, `huber` which is a combination of `squared_error` and `absolute_error`, or `quantile` which allows for quantile regression (using conditional median).	`"squared_error"`
`learningRate`	`float`	Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is `0.0`.	`0.1`
`nEstimators`	`int`	Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`3`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a gradient boosting regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.gradientBoostingRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"GbModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.gradientBoostingRegressor[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.kNeighborsClassifier[X;y;prediction;.qsp.use (!) . flip (
 (`nNeighbors; nNeighbors);
 (`weights ; weights);
 (`algorithm ; algorithm);
 (`leafSize ; leafSize);
 (`p ; p);
 (`metric ; metric);
 (`bufferSize; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nNeighbors`	`int`	Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is `1`.	`5`
`weights`	`string`	Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. Can be `uniform` to weight each neighbor's class equally or `distance` to weight each neighbor's class based on its distance to the point.	`uniform`
`algorithm`	`string`	Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. This algorithm can be a `ball_tree` algorithm, `kd_tree` algorithm, `brute` force distance measure approach, or an `auto` choice based on the data.	`auto`
`leafSize`	`int`	If `ball_tree` or `kd_tree` is selected as the `algorithm`, this is the minimum number of points in a given leaf node, after which point, brute force algorithm will be used to find the nearest neighbors. Setting this value either very close to `1` or very close to the total number of points in the data may have a noticeable impact on model runtime. Minimum value is `1`.	`30`
`p`	`int`	Power parameter used when the distance metric `minkowski` is selected. Minimum values is `0`.	`2`
`metric`	`string`	Distance metric used to measure the distance between points. This value can be `minkowski`, `euclidean`, `manhattan`, etc.	`minkowski`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a k-nearest neighbors classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.kNeighborsClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.kNeighborsClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.kNeighborsRegressor[X;y;prediction]
.qsp.ml.kNeighborsRegressor[X;y;prediction;.qsp.use (!) . flip (
 (`nNeighbors; nNeighbors);
 (`weights ; weights);
 (`metric ; metric);
 (`algorithm ; algorithm);
 (`bufferSize; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nNeighbors`	`int`	Number of points already labeled or predicted, which lie closest to a given unlabeled point (neighbors), to factor in when predicting a value for the point. Minimum value is `1`.	`5`
`metric`	`string`	The distance metric to be used for the tree. The default metric is `minkowski`, see here for available metrics.	`"minkowski"`
`weights`	`string`	Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. Can be `uniform`, to weight each neighbors target equally, or `distance`, to weight each neighbors target based on their distance to the point.	`"uniform"`
`algorithm`	`string`	Algorithm used to parse the vector space and decide which points are the nearest neighbors. You can choose to use the algorithms `ball_tree`, `kd_tree`, `brute` force distance measure approach, or an `auto` choice based on the data.	`"auto"`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a k-nearest neighbors regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.kNeighborsRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"k-nearest neighborsModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.kNeighborsRegressor[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.labelEncode[X]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], dictionary, or ::`	Name of the column(s) in the input table whose labels we want to encode. Can also be a dictionary mapping column names to their expected label values whereby the list of values will be used as the first set of encoding values for each column. If set to `::`, all categorical columns will be encoded as numeric values.	Required

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with the symbol columns in the data now having been label encoded as numeric values.

This operator encodes symbolic columns within input data as numeric representations. When data is fed into this operator via a stream, the specified symbol columns are encoded and the mapping of each symbol to its respective encoded number is stored as the state. If new symbols appear in subsequent batches, the state will be updated to reflect this.

Examples:

Example 1: Encode all symbol columns within the data.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.labelEncode[::]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 2: Encode symbols in column x.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.labelEncode[`x]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 3: Encode the symbols in the encoded column with the mapping specified.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
data: 10?`small`medium`large;
publish ([] original: data; encoded: data);

sp.ml.label_encode({'desk': ['equity', 'fixedIncome', 'foreignExchange'], 'broker': ['A', 'B', 'C']})

Parameters:

name	type	description	default
`columns`	`symbol, symbol[], dictionary, or ::`	String column name of column to be symbol encoded, or a list of strings containing the names of the columns to be encoded, or a dictionary with column names and their expected string/symbol values as the key-value pairs, or `None` to encode all symbolic columns.	Required

Returns:

A label_encode operator, which can be joined to other operators or pipelines.

Examples:

Label encode two columns with labels given by a dictionary:

>>> from kxi import sp
>>> import pykx as kx
>>> import pandas as pd
>>> import numpy as np
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.label_encode({'x': ['a', 'b', 'c'], 'x1': ['d', 'e', 'f']})
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': ['a','a','b','c','c','a','a','b','c','c'],
'x1': ['e','d','e','f','e','d','e','f','f','f'],
'x2': np.random.randn(10)
})
>>> kx.q('publish', data)

x x1 x2
---------------
0 1  -0.1463336
0 0  -0.4371426
1 1  -1.13658
2 2  0.3031224
2 1  0.3391162
0 0  0.22269
0 1  0.2385185
1 2  -0.296696
2 2  -1.972198
2 2  -0.1701961

q Python

.qsp.ml.lasso[X;y;prediction]
.qsp.ml.lasso[X;y;prediction;.qsp.use (!) . flip (
 (`alpha ; alpha);
 (`fitIntercept; fitIntercept);
 (`maxIter ; maxIter);
 (`tol ; tol);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`alpha`	`float`	Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is `0.0`.	`1.0`
`fitIntercept`	`boolean`	Whether to add a constant value (intercept) to the regression function - `c` in `y=mx+c`.	`1b`
`maxIter`	`int`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`1000`
`tol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-4`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a lasso regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the lasso operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.lasso[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"LassoModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([] 5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the lasso operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.lasso[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.linearRegression[X;y;prediction]
.qsp.ml.linearRegression[X;y;prediction;.qsp.use (!) . flip (
 (`trend ; trend);
 (`alpha ; alpha);
 (`maxIter ; maxIter);
 (`gTol ; gTol);
 (`seed ; seed);
 (`penalty ; penalty);
 (`lambda ; lambda);
 (`l1Ratio ; l1Ratio);
 (`decay ; decay);
 (`p ; p);
 (`bufferSize; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`trend`	`boolean`	Whether to add a constant value (intercept) to the regression function - `c` in `y=mx+c`.	`1b`
`alpha`	`float`	Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is `0.0`.	`0.01`
`maxIter`	`long`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`100`
`gTol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-5`
`seed`	`long`	Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If set to `::`, the randomness is based off the current timestamp.	`0`
`penalty`	`symbol`	Penalty term used to shrink the coefficients of the less contributive variables. Can be `l1` to add an L1 penalty term, `l2` to add an L2 penalty term, or `elasticNet` to add both L1 and L2 penalty terms.	`l2`
`lambda`	`float`	Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is `0.0`.	`0.001`
`l1Ratio`	`float`	If `elasticNet` is used as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to `0`, this is the same as using L2 regularization, if this value is set to `1`, this is the same as using L1 regularization. This value must lie in the range `[0.0, 1.0]`.	`0.5`
`decay`	`float`	Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is `0.0`.	`0f`
`p`	`float`	Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is `0.0`.	`0f`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

For all common arguments, refer to configuring operators

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the regression model and a prediction will also be made for each record.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with a linear regression model.

// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
 .qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.linearRegression[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

sp.ml.linear_regression('x', 'y', 'yHat', trend=True, buffer_size=10)

Parameters:

name	type	description	default
`features`	`symbol, symbol[], function, or ::`	Columns to be extracted as features. Either the individual column name as a string, or the list of column names as strings, or a callable function to extract the relevant features.	Required
`labels`	`symbol or function`	Column to be used as label data. String column name, or a callable function.	Required
`prediction`	`symbol, function, or ::`	Function as a string, or callable function used to append predictions to a batch of data.	Required
`udf`		Deprecated, replaced by prediction.

options:

name	type	description	default
`trend`	`boolean`	Boolean indicating if the model has a trend coefficient.	`1b`
`alpha`	`float`	Learning rate applied.	`0.01`
`max_iter`	`long`	Maximum possible number of iterations before the run is terminated. This does not guarantee convergence.	`100`
`g_tol`	`float`	Gradient tolerance, below which the run is terminated.	`1e-5`
`seed`	`long`	Random seed.	`0`
`penalty`	`symbol`	Regularization term as a `Penalty` enum value, or the string equivalent.	`l2`
`lambda_`	`float`	Regularization coefficient.	`0.001`
`l1_ratio`	`float`	Elastic net mixing parameter. This is only used if penalty type `'elasticNet'` is applied.	`0.5`
`decay`	`float`	Decay coefficient.	`0f`
`p`	`float`	Momentum coefficient.	`0f`
`buffer_size`	`long`	Integer value which defines the number of data points which must amass before linear regression is applied.	`0`

Returns:

A pipeline comprised of a linear_regression operator, which can be joined to other pipelines.

Examples:

Fit, predict, and update an online linear regression model on a stream:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.linear_regression('x', 'y', 'yHat', trend=True, buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': sorted(np.random.randn(10)),
'y': sorted(np.random.randn(10))
})
>>> kx.q('publish', data)

        x           y           yHat
-----------------------------------
-1.538765   -1.168329   -0.8969647
-0.9830464  -0.9178596  -0.5795978
-0.9159053  -0.2494006  -0.541254
-0.3951384  -0.213047   -0.2438476
-0.05936269 -0.1715577  -0.05208849
0.173539    -0.07597894 0.08092002
0.24296     0.3357135   0.1205659
0.4744972   0.5575508   0.2527951
0.8640336   0.8919448   0.4752567
2.707114    1.210953    1.527827

q Python

.qsp.ml.logClassifier[X;y;prediction]
.qsp.ml.logClassifier[X;y;prediction; .qsp.use (!) . flip (
 (`trend ; trend);
 (`alpha ; alpha);
 (`maxIter ; maxIter);
 (`gTol ; gTol);
 (`seed ; seed);
 (`penalty ; penalty);
 (`lambda ; lambda);
 (`l1Ratio ; l1Ratio);
 (`decay ; decay);
 (`p ; p);
 (`bufferSize; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment)
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`trend`	`boolean`	Whether to add a constant value (intercept) to the classification function - `c` in `y=mx+c`.	`1b`
`alpha`	`float`	Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is `0.0`.	`0.01`
`maxIter`	`long`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`100`
`gTol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-5`
`seed`	`long`	Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp.	`0`
`penalty`	`symbol`	Penalty term used to shrink the coefficients of the less contributive variables. Can be `l1` to add an L1 penalty term, `l2` to add an L2 penalty term, or `elasticNet` to add both L1 and L2 penalty terms.	`l2`
`lambda`	`float`	Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is `0.0`.	`0.001`
`l1Ratio`	`float`	If `Elastic Net` is chosen as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to `0`, this is the same as using L2 regularization, if this value is set to `1`, this is the same as using L1 regularization. This value must lie in the range `[0.0, 1.0]`.	`0.5`
`decay`	`float`	Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum values is `0.0`.	`0f`
`p`	`float`	Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is `0.0`.	`0f`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. As this is an online model, if subsequent data is passed to the stream, each new collection of data points will be used to update the classifier model and a predictions will be made for each record.

Performance Limitations

This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with a logistic classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
 .qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish data;
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish data;
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish data;

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.logClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

sp.ml.log_classifier('x', 'y', 'yHat', trend=True, buffer_size=10)

Parameters:

name	type	description	default
`features`	`symbol, symbol[], function, or ::`	Columns to be extracted as features. Either the individual column name as a string, or the list of column names as strings, or a callable function to extract the relevant features.	Required
`labels`	`symbol or function`	Column to be used as label data. String column name, or a callable function.	Required
`prediction`	`symbol, function, or ::`	A function as a string, or callable function used to append predictions to a batch of data.	Required
udf		Deprecated, replaced by prediction.

options:

name	type	description	default
`trend`	`boolean`	Boolean indicating if the model has a trend coefficient.	`1b`
`alpha`	`float`	Learning rate applied.	`0.01`
`max_iter`	`long`	Maximum possible number of iterations before the run is terminated. This does not guarantee convergence.	`100`
`g_tol`	`float`	Gradient tolerance, below which the run is terminated. seed: Random seed.	`1e-5`
`seed`	`long`	Random seed.	`0`
`penalty`	`symbol`	Regularization term as a `Penalty` enum value, or the string equivalent.	`l2`
`lambda_`	`float`	Regularization coefficient.	`0.001`
`l1_ratio`	`float`	Elastic net mixing parameter. This is only used if penalty type `'elasticNet'` is applied.	`0.5`
`decay`	`float`	Decay coefficient.	`0f`
`p`	`float`	Momentum coefficient.	`0f`
`buffer_size`	`long`	Integer value which defines the number of data points which must amass before linear regression is applied.	`0`

Returns:

A pipeline comprised of a log_classifier operator, which can be joined to other pipelines.

Examples:

Fit, predict, and update an online log classifier model on a stream:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.log_classifier('x', 'y', 'yHat', trend=True, buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': sorted(np.random.randn(10)),
'y': 5*[False]+5*[True]
})
>>> kx.q('publish', data)

x          y yHat
-----------------
-1.094345  0 0
-0.9376853 0 0
-0.7055189 0 0
-0.6558396 0 0
-0.4485486 0 0
-0.3953883 1 0
0.03139856 1 1
0.5305628  1 1
0.8246773  1 1
1.471353   1 1

q Python

.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
 (`bufferSize; bufferSize);
 (`rangeError; rangeError))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], dictionary, ::`	Name of the column(s) in the input table whose values we want to scale. Can also be a dictionary mapping column names to the minimum and maximum values to use when scaling. If set to `::`, all numeric columns will be scaled.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before scaling the numeric columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`
`rangeError`	`boolean`	Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator.	`0b`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with the numeric columns now being scaled so their values lie between `0` and `1`.

This operator scales a set of numeric columns based on a user-supplied data range or based on the minimum and maximum values in the data when the operator is applied. The operator will only be applied, and the minimum/maximum values decided upon, once the number of data point given to the model exceeds the value of the bufferSize parameter. This function can also be configured to error if data supplied after the ranges have been set falls outside this range.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Apply min-max scaling on all data.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.minMaxScaler[::]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)

Example 2: Apply min-max scaling on the specified columns.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.minMaxScaler[`x`x1]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)

Example 3: Apply min-max scaling on columns rating and cost, with supplied minimum and maximum values for one column and the other based on a buffer.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([] rating: 3 + 250?5; cost: 250?1000f)

Example 4: Error when passed batches containing data outside the min-max bounds.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
 .qsp.write.toConsole[]
// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)
// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)

sp.ml.min_max_scaler('price')

Parameters:

name	type	description	default
`columns`	`symbol, symbol[], dictionary, ::`	Column(s) to be min-max scaled in string format, or a dictionary with column names and `(min, max)` tuples as the key-value pairs to min-max scale several columns, or `None` to min-max scale all columns.	Required

options:

name	type	description	default
`buffer_size`	`long`	Number of data points which must amass before min-max scaling is applied.	`0`
`range_error`	`boolean`	Boolean indicating whether an error should be raised if new data falls outside the min-max range used for scaling.	`0b`

Returns:

A min_max_scaler operator, which can be joined to other operators or pipelines.

Examples:

Performs min-max scaling on a column of a batch of data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.min_max_scaler('x')
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.random.randn(10),
'x1': np.random.randn(10)
})
>>> kx.q('publish', data)

x         x1
---------------------
0.8952703 -0.03508657
1         1.588994
0.3312846 -0.7038333
0         -0.6815334
0.1397654 -0.8542387
0.6615214 0.3446243
0.200455  -0.3855581
0.3410276 -1.461748
0.9060353 1.485059
0.4513415 1.568283

q Python

.qsp.ml.MLPClassifier[X;y;prediction;.qsp.use (!) . flip (
 (`hiddenLayerSizes; hiddenLayerSizes);
 (`activation ; activation);
 (`solver; ; solver);
 (`alpha ; alpha);
 (`batchSize ; batchSize);
 (`learningRate ; learningRate);
 (`learningRateInit; learningRateInit);
 (`powerT ; powerT);
 (`maxIter ; maxIter);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the column which is to house the model's predicted label values for each data record OR a function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`hiddenLayerSizes`	`int[]`	List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is `1`.	`enlist 100`
`activation`	`string`	Activation function used to transform the output of the hidden layers into a single scalar value. This value can be `identity` to use a linear activation function, `logistic` to use a sigmoid activation function, `tanh` to use a hyperbolic tangent activation function, or `relu` to use a rectified linear unit function.	`relu`
`solver`	`string`	Optimization function used to search for the inputs that minimize/maximize the results of the model function. This value can be `lbfgs` to use a limited-memory BFGS, `sgd` to use stochastic gradient descent, or `adam` to use adaptive moment estimation.	`adam`
`alpha`	`float`	Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is `0.0`.	`0.0001`
`batchSize`	`int`	Number of training examples used in each stochastic optimization iteration. Minimum value is `1`.	`auto`
`learningRate`	`string`	Learning rate schedule for updating the weights of the neural network. Only used when the optimization function is set to `sgd`. This value can be `constant` for a constant learning rate, `optimal` for the optimal learning rate, `invscaling` to use an inverse scaling learning rate, or `adaptive` for an adaptive learning rate.	`constant`
`learningRateInit`	`float`	Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to `lmbfgs`. Minimum value is `0.0`.	`0.001`
`powerT`	`float`	Exponent used to update the learning rate when the learning rate is set to `invscaling` and the optimization function is set to `sgd`.	`0.5`
`maxIter`	`int`	Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`200`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under 'unnamedExperiments'.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a multi-layer perceptron classifier model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.MLPClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.MLPClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], dictionary, or ::`	Name of the column(s) in the input table to one-hot encode. Can also be a dictionary mapping column names to their expected values whereby only columns with these names and values will be encoded. If set to `::`, all categorical columns will be encoded as numeric values.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before one-hot encoding the symbol columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with the symbol columns in the data now each being represented by multiple numeric columns populated by `0`s and `1`s.

Encodes symbolic and string data as numeric representations. When data is fed into the operator via a stream, the algorithm will only be applied to the data when the number of records received has exceeded the value of the bufferSize parameter. When this happens, the buffered data is one-hot encoded. If subsequent data is passed which contains symbols that were not present at the time of the original fitting, these symbols will be mapped to 0.

Examples:

Example 1: Encode all the symbolic or string columns.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.oneHot[::]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)

Example 2: Encode column x.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.oneHot[`x]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] x:10?`a`b`c; y:10?1f)

Example 3: Encode columns x and x1 with a required buffer.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)

Example 4: Encode the columns axis and status using given values. This is useful when the categories are known in advance, but may not be present in the training data.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Example 5: Encode column axis and status using hybrid method

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.oneHot[`axis`status!(::; `normal`error)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

sp.ml.one_hot_encode(['trader','broker'])

Parameters:

name	type	description	default
`columns`	`symbol, symbol[], dictionary, or ::`	Column(s) to be one-hot encoded in string format, or a dictionary with column names and expected symbol/string values as the key-value pairs to be encoded, or `None` to encode all symbolic columns.	Required

options:

name	type	description	default
`buffer_size`	`long`	The number of data points which must amass before symbolic data is encoded.	`0`

Returns:

A one_hot_encode operator, which can be joined to other operators or pipelines.

Examples:

Performs one-hot encoding on a batch of data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.one_hot_encode()
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': ['a','a','b','c','c','a','b','c','a','b'],
'x1': ['a','b','b','c','b','a','b','c','b','b'],
'x2': np.random.randn(10),
'x3': 5*["abc","def"]
})
>>> kx.q('publish', data)

x2         x_a x_b x_c x1_a x1_b x1_c x3_abc x3_def
---------------------------------------------------
0.9442428  1   0   0   1    0    0    1      0
-0.2004368 1   0   0   0    1    0    0      1
-0.3597777 0   1   0   0    1    0    1      0
-1.408116  0   0   1   0    0    1    0      1
-0.2833716 0   0   1   0    1    0    1      0
1.072297   1   0   0   1    0    0    0      1
0.01474962 0   1   0   0    1    0    1      0
0.4720782  0   0   1   0    0    1    0      1
-1.032484  1   0   0   0    1    0    1      0
-0.3735379 0   1   0   0    1    0    0      1

q Python

.qsp.ml.registry.predict[X;prediction];
.qsp.ml.registry.predict[X;prediction; .qsp.use (!) . flip (
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`version ; version))]

Parameters:

name	type	description	default
`X`	`symbol[], ::, or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, all non-categorical columns are used.	Required
`prediction`	`symbol, ::, or function`	Can be the name of the column which is to house the model's predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the symbol `prediction` is used.	Required

options:

name	type	description	default
`model`	`string`	Name of the fitted model we want to load from the registry. If set to `::`, the most recently uploaded model will be loaded.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be loaded from. This can be a local path or a cloud storage path. If set to `::`, the local registry in the present working directory will be used.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model we want to load is stored under. If set to `::`, the model will be loaded from `unnamedExperiments`.	`::`
`version`	`float`	Version of the fitted model we want to load from the registry. If set to `::`, the latest version of the model will be loaded.	`::`

For all common arguments, refer to configuring operators

Returns:

type	description
`any`	Returns the input data with an additional column containing the model's predicted label values for each data point.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

.qsp.ml.registry.predict will predict the target value for each record in the batch, using a model from the registry.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Get predictions from an sklearn model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:desc n?10;x2:n?1f;y:asc n?5);
// Define and fit an sklearn model
features:flip value flip delete y from data;
clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf1:clf1[`max_depth pykw 3];
clf1[`:fit][features;data`y];
// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::];
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.predict[
 `x`x1`x2
 `yhat;
 .qsp.use enlist[`model]!enlist["skModel"]
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to get predctions from the model
publish data;

Example 2: Get predictions from a q model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and fit a q model
features:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];
// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.predict[
 `x`x1`x2;
 `yhat;
 .qsp.use enlist[`model]!enlist["kmeansModel"]
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor get predictions from the model
publish data;

Example 3: Get predictions from a q model by passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions to be passed to the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.predict[
 xFunc;
 clustFunc;
 .qsp.use enlist[`model]!enlist["kmeansModel"]
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data

sp.ml.predict(
['x','x1','x2'],
'yhat',
registry='/tmp',
model='skLC'
)

Parameters:

name	type	description	default
`features`	`symbol[], ::, or function`	Column name of the predictor variable, or the function required to generate predictors from a batch.	Required
`prediction`	`symbol, ::, or function`	Function for integrating the predictions into the batch, or a column name to join predictions to the table.	Required
`udf`		Deprecated, replaced by prediction.

options:

name	type	description	default
`model`	`string`	Model within the registry to use for prediction.	`::`
`registry`	`string`	Registry to load models from.	`::`
`experiment`	`string`	Experiment within the registry to load models from.	`::`
`version`	`float`	Model version to load in for prediction.	`::`

Returns:

A predict operator, which can be joined to other operators or pipelines.

Examples:

Save a model to registry then retrieve it to serve predictions:

>>> from kxi import sp, ml
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> import sklearn.linear_model as sk
>>> data = pd.DataFrame({
'x': np.random.randn(500),
'x1': sorted(np.random.randn(500)),
'x2': np.random.randn(500),
'y': 250*[False]+250*[True]
})
>>> X=data[['x','x1','x2']].to_numpy()
>>> y=data['y'].to_numpy()
>>> model=sk.LogisticRegression().fit(X,y)
>>> ml.registry.set.model(model,"skLC",'sklearn',folder_path="/tmp")
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.predict(['x','x1','x2'],
'yhat',
registry='/tmp',
model='skLC')
| sp.write.to_console(timestamp='none'))
>>> kx.q('publish', data[['x','x1','x2']].iloc[:10])

x           x1        x2          yhat
--------------------------------------
0.346299    -2.515892 0.2226648   0
0.4090534   -2.313049 -0.08848148 0
-1.102503   -2.303455 -0.4292562  0
0.05339919  -2.260012 0.02841283  0
-0.3747164  -2.257401 0.7677529   0
-0.08124184 -2.200042 -1.418281   0
1.393183    -2.198983 0.954484    0
0.3279165   -2.17734  0.08885551  0
0.1826349   -2.171023 -0.7898705  0
-2.060252   -2.113853 1.085127    0

q Python

.qsp.ml.quadraticDiscriminantAnalysis[X;y;prediction;.qsp.use (!) . flip (
 (`priors ; priors);
 (`bufferSize; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`priors`	`float[]`	List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is `0.0`. If set to `::`, the priors will be adjusted according to the data.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a quadratic discriminant analysis model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.quadraticDiscriminantAnalysis[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.quadraticDiscriminantAnalysis[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.randomForestClassifier[X;y;prediction.;.qsp.use (!) . flip (
 (`nEstimators ; nEstimators);
 (`criterion ; criterion);
 (`maxDepth ; maxDepth);
 (`minSamplesSplit ; minSamplesSplit);
 (`minSamplesLeaf ; minSamplesLeaf);
 (`minWeightFractionLeaf; minWeightFractionLeaf);
 (`maxFeatures ; maxFeatures);
 (`maxLeafNodes ; maxLeafNodes);
 (`minImpurityDecrease ; minImpurityDecrease);
 (`bootstrap ; bootstrap)
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `gini` to use the Gini impurity measure or `entropy` to use the information gain measure.	`gini`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `minSamplesSplit` value. Minimum value is `1`.	`::`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`minWeightFractionLeaf`	`float`	Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. When the `sample_weight` argument is not set using the `modelInit` parameter, each sample carries equal weight. This value must lie in the range `[0.0, 1.0]`.	`0.0`
`maxFeatures`	`string`	Maximum number of features to consider when looking for the best way to split a node. This value can be `sqrt` for the square root of all features, `log2` for log to the base 2 of all features, or `auto` to automatically select the number of features to consider.	`auto`
`maxLeafNodes`	`int`	Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If set to `::`, there may be unlimited leaf nodes. Minimum value is `1`.	`::`
`minImpurityDecrease`	`float`	Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is `0.0`.	`0.0`
`bootstrap`	`boolean`	Whether bootstrap samples are used when building trees. If `1b`, the whole dataset is used to build each tree.	`1b`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a random forest classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.randomForestClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.randomForestClassifier[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.randomForestRegressor[X;y;prediction]
.qsp.ml.randomForestRegressor[X;y;prediction;.qsp.use (!) . flip (
 (`nEstimators ; nEstimators);
 (`criterion ; criterion);
 (`minSamplesSplit; minSamplesSplit);
 (`minSamplesLeaf ; minSamplesLeaf);
 (`maxDepth ; maxDepth);
 (`bufferSize ; bufferSize);
 (`modelInit ; modelInit);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`config ; registryConfig)]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to `::` and the models `y` parameter is a function, an error will occur.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, function, or ::`	Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the default symbol `prediction` is used.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `squared_error` for mean squared error, `friedman_mse` for mean squared error with Friedman's improvement score, `absolute_error` for mean absolute error, or `poisson` for Poisson deviance.	`"squared_error"`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`dict`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a random forest regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.randomForestRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"RafrModel")]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
 | x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.randomForestRegressor[xFunc;yFunc;predFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

Coming Soon

q Python

.qsp.ml.score[y;predictions;metric]

Parameters:

name	type	description	default
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`predictions`	`symbol or function`	Can be the name of the column which houses the model's predictions OR a user-defined function that will generate predictions from the input data.	Required
`metric`	`symbol`	Metric to use to compare the predictions with the `y` target values.	Required

For all common arguments, refer to configuring operators

Returns:

type	description
`any`	The evaluation score given by the metric.

Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.

The following metrics are currently supported:

f1
accuracy
mse
rmse

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fits a scikit-learn model, predict y, and calculate the cumulative F1 score of the model on receipt of new data.

// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split the data into a training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
// Train the model
features:flip value flip delete y from training;
targets :training`y;
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];
// Set model within the existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];
// Define and run a stream processor pipeline using the score operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.predict[
 {delete y from x};
 `pred;
 .qsp.use enlist[`model]!enlist"skModel"]
 .qsp.ml.score[`y; `pred; `f1]
 .qsp.write.toConsole[];
// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing;

Example 2: Fit a q model, predict y, and score the cumulative accuracy on receipt of new data.

// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
// Train the model
features:flip value flip delete y from training;
targets:training`y;
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];
// Set model within the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]
// Define and run a stream processor pipeline using the score operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.predict[
 {delete y from x};
 `pred;
 .qsp.use enlist[`model]!enlist"myModel"]
 .qsp.ml.score[`y; `pred; `accuracy]
 .qsp.write.toConsole[]
// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing

sp.ml.score('y', 'yhat', 'mse')

Parameters:

name	type	description	default
`y_true`	`symbol or function`	Column containing the true values as a string, or a callable function to retrieve the true values from the batch.	Required
`y_pred`	`symbol or function`	Column containing the predicted values as a string, or a callable function to retrieve the predicted values from the batch.	Required
`metric`	`symbol`	A `Metric` enum value, or the string equivalent.	Required

Returns:

A score operator, which can be joined to other operators or pipelines.

Examples:

Computes the mean squared error between true and predicted results:

>>> from kxi import sp, ml
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> import sklearn.linear_model as sk
>>> data = pd.DataFrame({
'x' :np.random.randn(500),
'x1':sorted(np.random.randn(500)),
'x2':np.random.randn(500),
'y' :250*[False] + 250*[True]
})
>>> training = data.iloc[:450].reset_index(drop=True)
>>> testing = data.iloc[450:].reset_index(drop=True)
>>> X = training[['x','x1','x2']].to_numpy()
>>> y = training['y'].to_numpy()
>>> model = sk.LogisticRegression().fit(X,y)
>>> ml.registry.set.model(model,"skLC",'sklearn',folder_path="/tmp")
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.predict(['x', 'x1', 'x2'], 'yhat', registry='/tmp', model='skLC')
| sp.ml.score('y', 'yhat', 'mse')
| sp.write.to_console(timestamp='none'))
>>> kx.q('publish', testing)

0f

q Python

.qsp.ml.sequentialKMeans[X;cluster]
.qsp.ml.sequentialKMeans[X;cluster; .qsp.use (!) . flip (
 (`df ; df);
 (`k ; k);
 (`centers ; centers);
 (`init ; init);
 (`alpha ; alpha);
 (`forgetful ; forgetful);
 (`bufferSize; bufferSize);
 (`model ; model);
 (`registry ; registry);
 (`experiment; experiment);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol, symbol[], function, or ::`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, all non-categorical columns are extracted.	Required
`cluster`	`symbol or ::`	Name of the column which is to house the model's predicted class labels. If set to `::`, the default symbol `cluster` is used.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`k`	`long`	Final number of clusters to be defined by the model. Minimum value is `2`.	`3`
`centers`	`dictionary or ::`	A dictionary mapping each cluster to the cluster centroid value that we want these clusters to initialize with.	`::`
`init`	`bool`	Initialization method for the cluster centroids. This value can either be K-means++ (`1b`) or randomized initialization (`0b`).	`1b`
`alpha`	`float`	Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to `1/(n+1)` where `n` is the number of points in a given cluster. This value must lie in the range `[0.0, 1.0]`.	`0.1`
`forgetful`	`bool`	Whether to apply forgetful Sequential K-Means (`1b`) or normal Sequential K-Means (`0b`). Forgetful Sequential K-Means will allow the model to evolve its cluster boundaries over time by forgetting about old data as new data comes in.	`1b`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`config`	`any`	Configuration used for fitting the model.	`()!()`

For all common arguments, refer to configuring operators

Returns:

type	description
`table or ::`	Null during initial fitting. Afterwards returns the input data with an additional column containing the model's predicted cluster labels.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`clusters`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

The sequential K-Means algorithm is applied within a streaming framework. When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with the sequential K-Means model.

// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.sequentialKMeans[`x`x1`x2; `cluster; .qsp.use enlist[`bufferSize]!enlist 100]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish ([]100?1f;100?1f;100?1f);
// Now that the bufferSize has been reached, we can retrieve predictions using this fit model by passing new data
publish ([] 50?1f; 50?1f; 50?1f);

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
 update newClust: clusters from data
 };
// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.sequentialKMeans[xFunc;clustFunc]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;

sp.ml.sequential_k_means(['x', 'x1', 'x2'], 'myClusters', buffer_size=10)

Parameters:

name	type	description	default
`columns`	`symbol, symbol[], function, or ::`	Columns to extract from data for clustering.	Required
`cluster`	`symbol or ::`	Column name for integrating the predictions into the batch.	Required
`udf`		Deprecated, replaced by cluster.

options:

name	type	description	default
`distance_function`	`symbol`	Similarity measure used for clustering points; a `Distance` enum value, or the string equivalent.	`edist`
`centroids`	`long`	Number of cluster centers.	`3`
`init_centers`	`dictionary or ::`	Centers used for initialization of algorithm.	`::`
`init`	`bool`	Boolean indicating how to initialize cluster centroids - K-means++ (`True`) or randomized (`False`).	`1b`
`alpha`	`float`	Learning rate value between 0-1 used to define how much past centroid information to retain.	`0.1`
`forgetful`	`bool`	Apply forgetful (`True`) or normal sequential K-Means (`False`).	`1b`
`buffer_size`	`long`	Integer value defining the number of data points which must amass before initial clustering is applied.	`0`

Returns:

A pipeline comprised of a sequential_k_means operator, which can be joined to other pipelines.

Examples:

Discover clusters in a batch of data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.sequential_k_means(['x', 'x1', 'x2'], 'myClusters', buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.random.randn(10),
'x1': np.random.randn(10),
'x2': np.random.randn(10)
})
>>> kx.q('publish', data)

x           x1         x2          myClusters
---------------------------------------------
-0.03652799 -0.120031  -2.991009   1
-1.290761   0.1046565  0.007084028 0
1.005217    -0.8182222 -0.9399138  0
0.4791692   0.1235131  -0.6635202  0
-0.1670914  -0.2006089 0.5007579   0
-0.3650267  -1.105404  0.988489    2
-0.2914796  0.3655497  2.741289    2
-0.5778521  -1.107383  -1.215941   0
1.126198    -0.3380897 1.903199    2
-0.2179497  0.6098755  0.2427696   0

q Python

.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or ::`	Name of the column(s) in the input table to standardize. If set to `::`, all numeric columns will be standardized.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before standardizing the numerical columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

For all common arguments, refer to configuring operators

Returns:

type	description
`table`	Returns the input data with the numeric columns now having a mean value of `0` and a standard deviation of `1`.

Standardize a user-specified set of columns in an input table. When data is fed into this operator via a stream, the algorithm will only scale the data when the number of records received has exceeded the value of the bufferSize parameter. Once this happens, the mean and standard deviation of each column is computed. These statistics are then used on subsequent batches which are normalized by subtracting this mean value and dividing the result by the standard deviation value.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Applies standardization to all data.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.standardize[::]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)

Example 2: Apply standardization to specified columns.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.standardize[`x`x1]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)

Example 3: This pipeline applies standardization to all columns based on a buffer.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.standardize[::; .qsp.use enlist[`bufferSize]!enlist 200]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([] length: 100 + 250?2f; width: 10 + 250?1f);

sp.ml.standardize()

Parameters:

name	type	description	default
`columns`	`symbol or symbol[] or ::`	Either column(s) to be standard scaled in string format, or `None` standard scale all columns.	Required

options:

name	type	description	default
`buffer_size`	`long`	The number of data points which must amass before standard scaling is applied.	`0`

Returns:

A standardize operator, which can be joined to other operators or pipelines.

Examples:

Apply standard scaling to a batch of data:

>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.standardize()
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': 10*np.random.randn(10),
'x1': 100*np.random.randn(10),
'x2': 20*np.random.randn(10)
})
>>> kx.q('publish', data)

x           x1         x2
----------------------------------
1.480517    -0.6734071 -0.9577657
0.2441469   0.2270268  1.673357
-0.06926182 -1.072527  0.03043792
0.8506082   -0.6297122 -0.8603573
0.9383493   2.502053   0.2302604
0.2230779   -0.1797943 -0.04656411
-1.520206   0.6135481  -1.625691
0.3302589   -1.047889  -0.6386158
-0.6906624  0.401498   0.982027
-1.786828   -0.1407956 1.212912

q Python

.qsp.ml.registry.update[X;y;prediction]
.qsp.ml.registry.update[X;y;prediction; .qsp.use (!) . flip (
 (`untrained ; untrained);
 (`modelArgs ; modelArgs);
 (`model ; model);
 (`registry ; registry);
 (`experiment ; experiment);
 (`version ; version);
 (`config ; config))]

Parameters:

name	type	description	default
`X`	`symbol[], ::, or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to `::`, all non-categorical and non-target columns are used.	Required
`y`	`symbol or function`	Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use.	Required
`prediction`	`symbol, ::, or function`	Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to `::`, the symbol `prediction` is used.	Required

options:

name	type	description	default
`untrained`	`function or embedpy`	An untrained q model that we want to update. If set to `()`, the registry related parameters will be used to load a model from the registry to be updated.	`()`
`bufferSize`	`long`	Number of records to observe before updating the model. If set to `0`, the model will be updated on the first batch. Minimum value is `0`.	`0`
`modelArgs`	`list`	List of argument passed to the model to help configure the updating process.	`::`
`model`	`string`	Name of the model to be loaded from/stored in the registry. If no value is supplied, the model will not be loaded from/stored in the registry.	`::`
`registry`	`string`	Location of the registry where the model to be loaded is stored/the updated model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory for the model to be stored.	`::`
`experiment`	`string`	Name of the experiment in the registry that the model is to be loaded from/the updated model is to be stored under. If no value is supplied, the model will be loaded from/stored under `unnamedExperiments`.	`::`
`version`	`long[]`	Version of the model we want to load from the registry. If set to `::`, the latest version of the model will be loaded.	`::`
`config`	`any`	Dictionary used to configure additional settings when saving the model to the registry.	`()!()`

For all common arguments, refer to configuring operators

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

Note

Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name	type	description
`data`	`any`	Batch passed to the operator, only the data not the metadata.
`y`	`symbol, function, or ::`	Target variable supplied to the model as the `y` parameter.
`predictions`	`list`	Model's predictions for each record in the batch.
`modelInfo`	`::`	Information about the model. Currently not used and always set to `::`.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

Train a model incrementally returning predictions for each record in a batch.

Python support

Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit an untrained q model which can be updated.

// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);
// Define and run a stream processor pipeline using the update model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.update[
 `x`x1;
 `y;
 `yhat;
 .qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to update the model
publish data;

Example 2: Fit an untrained model by passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions to be passed to the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
 delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
 };
predFunc: {[data;y;predictions;modelInfo]
 update newPred: predictions from data
 };
// Define and run a stream processor pipeline using the update model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.update[
 xFunc;
 yFunc;
 predFunc;
 .qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data

Example 3: Update a q model from the registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Generate data to be used for updating
n:100000;
data2:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and fit a q model
features:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];
// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];
// Define optional model fitting parameters
optKeys:`model`registry`experiment;
optVals:("kmeansModel";::;::);
opt:optKeys!optVals;
// Define and run a stream processor pipeline using the update model operator
.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.ml.registry.update[
 `x`x1;
 `y;
 `yhat;
 .qsp.use opt
 ]
 .qsp.write.toConsole[];
// Pass a batch of data to the stream processor to update the model
publish data2;
// Call the get model store function to show the original and updated models have been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

Coming Soon

Machine Learning

AdaBoost Classifier

AdaBoost Regressor

Affinity Propagation Clustering

Birch Clustering

Clustering Using Representatives

DBSCAN Clustering

Decision Tree Classifier

Decision Tree Regressor

Drop Constant Columns

Feature Hasher

Fit Registry Model

FRESH Feature Creation

Gaussian Naïve Bayes Classifier

Gradient Boosting Regressor

K-Nearest Neighbours Classifier

K-Nearest Neighbours Regressor

Label Encode Categorical Columns

Lasso Regressor

Linear Regressor

Logistic Classifier

Min-Max Scale Numerical Columns

Multi-Layer Perceptron Classifier

One-Hot Encode Categorical Columns

Predict Using Registry Model

Quadratic Discriminant Analysis Classifier

Random Forest Classifier

Random Forest Regressor

Score Model

Sequential K-Means Clustering

Standardize Numerical Columns

Update Registry Model