Machine Learning
This page details the Machine Learning APIs that are available to use in the Stream Processor including Classifiction, Clustering, Regression, Feature Creation, Metrics, Custom (Registry) Models and general ML utility functions.
keyed tables
All ml operators act solely on unkeyed tables (type 98).
AdaBoost Classifier
Applies an AdaBoost Classifier model to the data.
.qsp.ml.adaBoostClassifier[X;y;prediction;.qsp.use (!) . flip (
(`nEstimators ; nEstimators);
(`learningRate; learningRate);
(`algorithm ; algorithm);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the features values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value |
|
|
|
Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is |
|
|
|
Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. This value can be |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with an adaBoost classification model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.adaBoostClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.adaBoostClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
AdaBoost Regressor
Applies a AdaBoost Regressor model to the data.
.qsp.ml.adaBoostRegressor[X;y;prediction]
.qsp.ml.adaBoostRegressor[X;y;prediction;.qsp.use (!) . flip (
(`nEstimators ; nEstimators);
(`learningRate; learningRate);
(`loss ; loss);
(`modelInit ; modelInit);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is |
|
|
|
Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. This value must lie in the range |
|
|
|
Loss function used to update the contributing weights of the regressors after each boosting iteration. This can be |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit an adaBoost regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.adaBoostRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"AdaModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.adaBoostRegressor[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Affinity Propagation Clustering
Applies an Affinity Propagation Clustering Algorithm to the data.
.qsp.ml.affinityPropagation[X;cluster;.qsp.use (!) . flip (
(`damping ; damping);
(`maxIter ; maxIter);
(`convergenceIter; convergenceIter);
(`affinity ; affinity);
(`randomState ; randomState);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range |
|
|
|
Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is |
|
|
|
Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is |
|
|
|
Statistical measure used to define similarities between the representative points. This value can be |
|
|
|
Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If set to |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted cluster labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or cluster
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the cluster
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the cluster
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;clusters;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a affinityPropagation clustering model storing the result in a registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.affinityPropagation[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"AffinityPropagationModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
| x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.affinityPropagation[xFunc;clustFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Birch Clustering
Applies a Birch Clustering Algorithm to the data.
.qsp.ml.birch[X;cluster;.qsp.use (!) . flip (
(`threshold ; threshold);
(`branchingFactor; branchingFactor);
(`nClusters ; nClusters);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to the cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is |
|
|
|
Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is |
|
|
|
Final number of clusters to be defined by the model. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Configuration for fitting the model. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted cluster labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or cluster
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the cluster
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the cluster
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;clusters;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a Birch clustering model storing the result in a registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the birch operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.birch[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"BirchModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
| x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the birch operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.birch[xFunc;clustFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Clustering Using Representatives
Applies a CURE Clustering Algorithm to the data.
.qsp.ml.cure[X;cluster;.qsp.use (!) . flip (
(`df ; df);
(`n ; n);
(`c ; c);
(`cutDict ; cutDict);
(`bufferSize; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Distance function used to measure the distance between points when clustering. This can be |
|
|
|
Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is |
|
|
|
Compression factor used for grouping the representative points together. Minimum value is |
|
|
|
Final number of clusters to be defined by the model. Minimum value is |
|
|
|
Distance between leaves at which to cut the dendrogram to define the clusters. Minimum value is |
|
|
|
A dictionary that defines the cutting algorithm used when splitting the data into clusters. This can be used to define a |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted cluster labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or cluster
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the cluster
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the cluster
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;clusters;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a cure clustering model storing the result in a registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the cure operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.cure[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"cureModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
| x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the cure operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.cure[xFunc;clustFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
DBSCAN Clustering
Applies a DBSCAN Clustering Algorithm to the data.
.qsp.ml.dbscan[X;cluster;.qsp.use (!) . flip (
(`df ; df);
(`minPts ; minPts);
(`eps ; eps);
(`bufferSize; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Distance function used to measure the distance between points when clustering. This can be |
|
|
|
Minimum number of points required to be close together before this group of points is defined as a cluster. The maximum distance these points are to be away from one another must be less than or equal to the |
|
|
|
Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Configuration for fitting the model. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted cluster labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or cluster
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the cluster
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the cluster
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;clusters;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a dbscan clustering model storing the result in a registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dbscan[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"dbscanModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
| x x1 x2 cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473 0.7141816 0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309 0.6165058 0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675 0.514663 1
2022.03.01D09:26:44.376050100| 0.7639509 0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209 0.59064 0.6708373 1
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dbscan[xFunc;clustFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Decision Tree Classifier
Applies a Decision Tree Classifier model to the data.
.qsp.ml.decisionTreeClassifier[X;y;prediction;.qsp.use (!) . flip (
(`criterion ; criterion);
(`splitter ; splitter);
(`maxDepth ; maxDepth);
(`minSamplesSplit; minSamplesSplit);
(`minSamplesLeaf ; minSamplesLeaf);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be |
|
|
|
Strategy used to split the nodes in the tree. This can be |
|
|
|
Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to |
|
|
|
Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is |
|
|
|
Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a decision tree classifier model on data and store the model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.decisionTreeClassifier[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTCModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry.
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.decisionTreeClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Decision Tree Regressor
Applies a Decision Tree Regressor model to the data.
.qsp.ml.decisionTreeRegressor[X;y;prediction]
.qsp.ml.decisionTreeRegressor[X;y;prediction;.qsp.use (!) . flip (
(`criterion ; criterion);
(`splitter ; splitter);
(`maxDepth ; maxDepth);
(`minSamplesSplit; minSamplesSplit);
(`minSamplesLeaf ; minSamplesLeaf);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be |
|
|
|
Strategy used to split the nodes in the tree. This can be |
|
|
|
Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is |
|
|
|
Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is |
|
|
|
Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a decision tree regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.decisionTreeRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.decisionTreeRegressor[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Drop Constant Columns
Drops columns with constant values from the data.
.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the input table to remove because they contain a constant value throughout. Can also be a dictionary mapping column names to their associated constant values whereby only columns with these names and values will be dropped. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to observe before dropping the constant columns from the data. If set to |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with the constant valued columns no longer in the table. |
The columns to be removed from the data are either specified by the user beforehand, through
a list or dictionary, or these columns are determined using the .ml.dropConstant
function.
This function checks the data for columns that contain a constant value throughout.
If a non-constant column is supplied, an error is thrown.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Drop the constant columns protocol
and response
.
// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[`protocol`response]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);
Example 2: Drop the columns id
and ratio
, checking that their values match the expected constant values.
// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[`id`ratio!(1; 2f)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] id: 1; ratio: 2f; data: 10?10f);
Example 3: Drop columns whose value is constant for all buffered records.
// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to drop its constant values
publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)
sp.ml.drop_constant('lastPrice')
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Columns to drop. Either a column name as string to drop a single column, or a list of strings to drop several columns, or a dictionary with column names and expected constant values as the key-value pairs to drop several columns, or |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
The number of data points which must amass before columns are dropped. |
|
Returns:
A pipeline comprised of a 'drop_constant' operator, which can be joined to other pipelines.
Examples:
Drop a constant column from a batch the data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.drop_constant('x')
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.ones(10),
'a': np.random.randn(10),
'b': np.random.randn(10)
})
>>> kx.q('publish' , data)
a b
---------------------
1.190654 0.3066241
-0.4255034 1.713646
-1.185454 -0.6019988
-2.161811 0.7125692
-1.934769 0.2542738
2.057811 -0.4257737
0.4890321 0.9682337
-0.1086504 1.506568
-1.311511 -1.68155
0.07270301 0.8865468
Feature Hasher
(Beta Feature) Encodes categorical data across several numeric columns.
Beta Features
To enable beta features, set the environment variable KXI_SP_BETA_FEATURES
to true
. See Beta Feature Usage Terms.
.qsp.ml.featureHasher[X;n]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the data to perform the feature hashing on. |
Required |
|
|
Number of resulting numeric columns to represent each specified column. |
Required |
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input table with each specified column now replaced by |
This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.
It converts each chosen column into n
columns, sending each string/symbol to its
truncated hash value. The hash function employed is the signed 32-bit version of
Murmurhash3.
As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Encode a single categorical column.
// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.featureHasher[`location; 10]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its features through hashing
publish ([] location: 20?`london`paris`berlin`miami; num: til 20);
Example 2: Encode multiple categorical columns.
// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.featureHasher[`srcIP`destIP; 14]
.qsp.write.toVariable[`output];
// Pass a batch of data to the stream processor to encode its features through hashing
IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);
Coming Soon
Fit Registry Model
Fits a model to the data and predicts target for future batches.
.qsp.ml.registry.fit[X;y;untrained;modelType;prediction]
.qsp.ml.registry.fit[X;y;untrained;modelType;prediction; .qsp.use (!) . flip (
(`bufferSize; bufferSize);
(`modelArgs ; modelArgs);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use. This must be |
Required |
|
|
An untrained q/sklearn model that we want to fit. |
Required |
|
|
Whether the model we are fitting is a |
Required |
|
|
Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
List of arguments passed to the model to help configure the fitting process. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
The current batch, modified in accordance with the |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.
N.B. This is only for models that cannot be trained incrementally. For other models, .qsp.ml.registry.update
should be used.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a q model.
// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);
// Define optional model fitting parameters
optKeys:`model`registry`experiment`modelArgs;
optVals:("sgdLR";::;::;(1b; `maxIter`gTol`seed!(100;-0w;42)));
opt:optKeys!optVals;
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
`x`x1;
`y;
.ml.online.sgd.linearRegression;
"q";
`yhat;
.qsp.use opt
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore[::;::]
Example 2: Fit an sklearn model.
// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5);
// Define and fit an sklearn model
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2];
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
`x`x1;
`y;
rfc;
"sklearn";
`yhat
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Example 3: Fit an unsupervised q model.
// Generate initial data to be used for fitting
data:([]x:1000?1f;x1:1000?1f;x2:1000?1f);
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
`x`x1`x2;
::;
.ml.clust.kmeans;
"q";
`cluster;
.qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Example 4: Fit a model while passing functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the fit model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
xFunc;
yFunc;
.ml.online.sgd.linearRegression;
"q";
predFunc;
.qsp.use enlist[`modelArgs]!enlist(1b; `maxIter`gTol`seed!(100;-0w;42))
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data
sp.ml.fit(
kx.q('{delete y from x}'),
'y',
kx.q('.ml.online.sgd.logClassifier'),
sp.ml.ModelType.q,
'yhat',
registry='/tmp',
model='qLC',
model_args=[True, {'seed': 42, 'k': 5}]
)
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Column name of the predictor variable, or the function required to generate predictors from a batch. |
Required |
|
|
Column name of the target variable, or the function required to generate predictors from a batch, or |
Required |
|
|
Untrained model. |
Required |
|
|
A |
Required |
|
|
Function used to score the quality of the model or to join predictions into the batch. |
Required |
|
Deprecated, replaced by prediction. |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to buffer before training the model, or False to fit on the first batch. When the batch size is exceeded, any additional records will also be included when training. |
|
|
|
List of arguments to pass to the model after X and y. If there is only a single argument, it must be enlisted. |
|
|
|
Name of the current model to be stored within the registry. If 'model' is omitted or |
|
|
|
Local/cloud registry to save the model to. |
|
|
|
Name of the experiment within the registry to save the current model to. |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. See Config Values table below. |
|
Config Values:
name |
type |
description |
default |
---|---|---|---|
|
If provided with 'data', the function will attempt to parse out relevant statistical information associated with the data for use within model deployment. |
||
|
Option to add Python requirements information associated with a model. Either a boolean value |
||
|
Boolean value indicating if model version is to be incremented as a 'major' version i.e. should the model be incremented from |
||
|
The major version to be incremented. By default the function will increment major versions based on the current maximum model version present within the registry. However, 'major_version' allows users to define which major version of a model they wish to be incremented. |
||
|
Reference to the location of any |
||
|
Boolean value indicating the required data format for model training, i.e. should the data have |
||
|
List of metrics to be used for supervised monitoring of the model. |
Returns:
A fit
operator, which can be joined to other operators or pipelines.
Examples:
Fit a model on a batch of data and view model in model store:
>>> from kxi import sp
>>> import pykx as kx
>>> import pytest
>>> from sklearn.cluster import KMeans
>>> from sklearn.linear_model import LinearRegression
>>> import pandas as pd
>>> import random
>>> def sp_pipeline(operator):
sp.run(sp.read.from_callback('publish')
| operator
| sp.write.to_variable('.test.cache', mode='overwrite'))
>>> bool_data = pd.DataFrame({
'x': [random.uniform(0, 1) for _ in range(100)],
'x1': [random.uniform(0, 1) for _ in range(100)],
'x2': sorted([random.uniform(0, 1) for _ in range(100)]),
'y': [bool(i) for i in range(1, 101)],
})
>>> sp_pipeline(
sp.ml.fit(kx.q('{delete y from x}'),
'y',
kx.q('.ml.online.sgd.logClassifier'),
sp.ml.ModelType.q,
'yhat',
registry='/tmp',
model='qLC',
model_args=[True, {'seed': 42, 'k': 5}]))
>>> kx.q('publish', bool_data)
>>> assert kx.q('100~count .test.cache')
>>> assert kx.q('{cols[.test.cache]~cols[x],`yhat}', bool_data)
>>> assert kx.q('01b~asc distinct .test.cache`yhat')
>>> kx.q('.test.cache')
x x1 x2 y yhat
-----------------------------------------
0.438085 0.1456259 0.009515322 0 0
0.9236949 0.5339187 0.01069584 0 0
0.1721638 0.01717696 0.02064306 0 0
...
0.7064919 0.9793561 0.5711501 1 0
0.4621936 0.9414561 0.5720107 1 0
0.05334808 0.09241486 0.5788343 1 0
0.3619266 0.7338493 0.5957937 1 0
0.9695426 0.4179357 0.601046 1 0
0.6921203 0.6083793 0.6513348 1 0
0.09326991 0.9893873 0.6885329 1 0
0.1829138 0.7280614 0.7131808 1 0
0.945359 0.1291827 0.7154933 1 0
0.05329599 0.09970421 0.7177233 1 1
0.3780809 0.9164885 0.7378628 1 0
0.7895978 0.1609393 0.7734308 1 0
...
FRESH Feature Creation
Applies FRESH feature creation to the data.
.qsp.ml.freshCreate[X;features]
.qsp.ml.freshCreate[X;features;.qsp.use enlist[`warn]!enlist warn]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the data to use for FRESH feature generation. |
Required |
|
|
Name of the FRESH feature(s) we want to define from the data. A full list of these features can be found here. |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Show warnings 1b / Suppress warnings 0b. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns a table containing the specified aggregated FRESH feature columns for each selected column in the input table. |
Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.
For the feature
parameter, if it is set to:
::
- all features are applied.
noHyperparameters
- all features except hyperparameters are applied.
noPython
- all features that don't rely on Python are applied.
As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Build two features - absEnergy
and max
.
// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.window.tumbling[00:01:00; `time]
.qsp.ml.freshCreate[`x; `absEnergy`max]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to create the fresh features
publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);
Example 2: Build all features.
// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.window.count[100]
.qsp.ml.freshCreate[`x; ::]
.qsp.write.toVariable[`output];
// Pass a batch of data to the stream processor to create the fresh features
publish ([] x: 500?1f; y: 500?100);
sp.ml.fresh_create('x', ['min', 'max', 'absEnergy'])
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Columns on which FRESH feature creation is to be applied, or |
Required |
|
|
Functions to be applied to each column in a batch as either a string naming the function to be applied, or list of strings naming the functions to be applied, or |
Required |
Returns:
A fresh_create
operator, which can be joined to other operators or pipelines.
Examples:
Setting up a stream to compute three features, with a window to batch the data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.window.timer(np.timedelta64(10, 's'), count_trigger = 1)
| sp.ml.fresh_create('x', ['min', 'max', 'absEnergy'])
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({'x': np.random.randn(10)})
>>> kx.q('publish', data)
x_absEnergy x_max x_min
------------------------------
14.30635 0.917339 -2.102371
Gaussian Naïve Bayes Classifier
Applies a Gaussian Naive Bayes Classifier model to the data.
.qsp.ml.gaussianNB[X;y;prediction;.qsp.use (!) . flip (
(`priors ; priors);
(`varSmoothing; varSmoothing);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is |
|
|
|
Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with a gaussian naive bayes model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.gaussianNB[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.gaussianNB[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Gradient Boosting Regressor
Applies a Gradient Boosting Regressor model to the data.
.qsp.ml.gradientBoostingRegressor[X;y;prediction]
.qsp.ml.gradientBoostingRegressor[X;y;prediction;.qsp.use (!) . flip (
(`loss ; loss);
(`learningRate ; learningRate);
(`nEstimators ; nEstimators);
(`minSamplesSplit; minSamplesSplit);
(`minSamplesLeaf ; minSamplesLeaf);
(`maxDepth ; maxDepth);
(`modelInit ; modelInit);
(`bufferSize ; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Loss function that is optimized using gradient descent to get the best model fit. Can be |
|
|
|
Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is |
|
|
|
Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is |
|
|
|
Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is |
|
|
|
Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is |
|
|
|
Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a gradient boosting regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.gradientBoostingRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"GbModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.gradientBoostingRegressor[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
K-Nearest Neighbours Classifier
Applies a K-Nearest Neighbours Classifier model to the data.
.qsp.ml.kNeighborsClassifier[X;y;prediction;.qsp.use (!) . flip (
(`nNeighbors; nNeighbors);
(`weights ; weights);
(`algorithm ; algorithm);
(`leafSize ; leafSize);
(`p ; p);
(`metric ; metric);
(`bufferSize; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is |
|
|
|
Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. Can be |
|
|
|
Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. This algorithm can be a |
|
|
|
If |
|
|
|
Power parameter used when the distance metric |
|
|
|
Distance metric used to measure the distance between points. This value can be |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with a k-nearest neighbors classification model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.kNeighborsClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.kNeighborsClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
K-Nearest Neighbours Regressor
Applies a K-Nearest Neighbours Regressor model to the data.
.qsp.ml.kNeighborsRegressor[X;y;prediction]
.qsp.ml.kNeighborsRegressor[X;y;prediction;.qsp.use (!) . flip (
(`nNeighbors; nNeighbors);
(`weights ; weights);
(`metric ; metric);
(`algorithm ; algorithm);
(`bufferSize; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of points already labeled or predicted, which lie closest to a given unlabeled point (neighbors), to factor in when predicting a value for the point. Minimum value is |
|
|
|
The distance metric to be used for the tree. The default metric is |
|
|
|
Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. Can be |
|
|
|
Algorithm used to parse the vector space and decide which points are the nearest neighbors. You can choose to use the algorithms |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a k-nearest neighbors regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.kNeighborsRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"k-nearest neighborsModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.kNeighborsRegressor[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Label Encode Categorical Columns
Encodes symbolic columns as numeric data.
.qsp.ml.labelEncode[X]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the input table whose labels we want to encode. Can also be a dictionary mapping column names to their expected label values whereby the list of values will be used as the first set of encoding values for each column. If set to |
Required |
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with the symbol columns in the data now having been label encoded as numeric values. |
This operator encodes symbolic columns within input data as numeric representations. When data is fed into this operator via a stream, the specified symbol columns are encoded and the mapping of each symbol to its respective encoded number is stored as the state. If new symbols appear in subsequent batches, the state will be updated to reflect this.
Examples:
Example 1: Encode all symbol columns within the data.
// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[::]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);
Example 2: Encode symbols in column x
.
// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[`x]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);
Example 3: Encode the symbols in the encoded
column with the mapping specified.
// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to encode its columns
data: 10?`small`medium`large;
publish ([] original: data; encoded: data);
sp.ml.label_encode({'desk': ['equity', 'fixedIncome', 'foreignExchange'], 'broker': ['A', 'B', 'C']})
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
String column name of column to be symbol encoded, or a list of strings containing the names of the columns to be encoded, or a dictionary with column names and their expected string/symbol values as the key-value pairs, or |
Required |
Returns:
A label_encode
operator, which can be joined to other operators or pipelines.
Examples:
Label encode two columns with labels given by a dictionary:
>>> from kxi import sp
>>> import pykx as kx
>>> import pandas as pd
>>> import numpy as np
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.label_encode({'x': ['a', 'b', 'c'], 'x1': ['d', 'e', 'f']})
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': ['a','a','b','c','c','a','a','b','c','c'],
'x1': ['e','d','e','f','e','d','e','f','f','f'],
'x2': np.random.randn(10)
})
>>> kx.q('publish', data)
x x1 x2
---------------
0 1 -0.1463336
0 0 -0.4371426
1 1 -1.13658
2 2 0.3031224
2 1 0.3391162
0 0 0.22269
0 1 0.2385185
1 2 -0.296696
2 2 -1.972198
2 2 -0.1701961
Lasso Regressor
Applies a Lasso Regressor model to the data.
.qsp.ml.lasso[X;y;prediction]
.qsp.ml.lasso[X;y;prediction;.qsp.use (!) . flip (
(`alpha ; alpha);
(`fitIntercept; fitIntercept);
(`maxIter ; maxIter);
(`tol ; tol);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is |
|
|
|
Whether to add a constant value (intercept) to the regression function - |
|
|
|
Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is |
|
|
|
Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a lasso regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the lasso operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.lasso[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"LassoModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([] 5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the lasso operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.lasso[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Linear Regressor
Applies a Linear Regressor model to the data, using stochastic gradient descent.
.qsp.ml.linearRegression[X;y;prediction]
.qsp.ml.linearRegression[X;y;prediction;.qsp.use (!) . flip (
(`trend ; trend);
(`alpha ; alpha);
(`maxIter ; maxIter);
(`gTol ; gTol);
(`seed ; seed);
(`penalty ; penalty);
(`lambda ; lambda);
(`l1Ratio ; l1Ratio);
(`decay ; decay);
(`p ; p);
(`bufferSize; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Whether to add a constant value (intercept) to the regression function - |
|
|
|
Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is |
|
|
|
Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is |
|
|
|
Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is |
|
|
|
Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If set to |
|
|
|
Penalty term used to shrink the coefficients of the less contributive variables. Can be |
|
|
|
Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is |
|
|
|
If |
|
|
|
Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is |
|
|
|
Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
For all common arguments, refer to configuring operators
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the regression model and a prediction will also be made for each record.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, update, and predict with a linear regression model.
// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
.qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.linearRegression[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
sp.ml.linear_regression('x', 'y', 'yHat', trend=True, buffer_size=10)
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Columns to be extracted as features. Either the individual column name as a string, or the list of column names as strings, or a callable function to extract the relevant features. |
Required |
|
|
Column to be used as label data. String column name, or a callable function. |
Required |
|
|
Function as a string, or callable function used to append predictions to a batch of data. |
Required |
|
Deprecated, replaced by prediction. |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Boolean indicating if the model has a trend coefficient. |
|
|
|
Learning rate applied. |
|
|
|
Maximum possible number of iterations before the run is terminated. This does not guarantee convergence. |
|
|
|
Gradient tolerance, below which the run is terminated. |
|
|
|
Random seed. |
|
|
|
Regularization term as a |
|
|
|
Regularization coefficient. |
|
|
|
Elastic net mixing parameter. This is only used if penalty type |
|
|
|
Decay coefficient. |
|
|
|
Momentum coefficient. |
|
|
|
Integer value which defines the number of data points which must amass before linear regression is applied. |
|
Returns:
A pipeline comprised of a linear_regression
operator, which can be joined to other pipelines.
Examples:
Fit, predict, and update an online linear regression model on a stream:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.linear_regression('x', 'y', 'yHat', trend=True, buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': sorted(np.random.randn(10)),
'y': sorted(np.random.randn(10))
})
>>> kx.q('publish', data)
x y yHat
-----------------------------------
-1.538765 -1.168329 -0.8969647
-0.9830464 -0.9178596 -0.5795978
-0.9159053 -0.2494006 -0.541254
-0.3951384 -0.213047 -0.2438476
-0.05936269 -0.1715577 -0.05208849
0.173539 -0.07597894 0.08092002
0.24296 0.3357135 0.1205659
0.4744972 0.5575508 0.2527951
0.8640336 0.8919448 0.4752567
2.707114 1.210953 1.527827
Logistic Classifier
Applies a Logistic Classifier model to the data, using stochastic gradient descent.
.qsp.ml.logClassifier[X;y;prediction]
.qsp.ml.logClassifier[X;y;prediction; .qsp.use (!) . flip (
(`trend ; trend);
(`alpha ; alpha);
(`maxIter ; maxIter);
(`gTol ; gTol);
(`seed ; seed);
(`penalty ; penalty);
(`lambda ; lambda);
(`l1Ratio ; l1Ratio);
(`decay ; decay);
(`p ; p);
(`bufferSize; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment; experiment)
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Whether to add a constant value (intercept) to the classification function - |
|
|
|
Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is |
|
|
|
Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is |
|
|
|
Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is |
|
|
|
Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. |
|
|
|
Penalty term used to shrink the coefficients of the less contributive variables. Can be |
|
|
|
Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is |
|
|
|
If |
|
|
|
Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum values is |
|
|
|
Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
As this is an online model, if subsequent data is passed to the stream, each new collection of data points will be used to update the classifier model and a predictions will be made for each record.
Performance Limitations
This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, update, and predict with a logistic classification model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
.qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish data;
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish data;
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish data;
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.logClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
sp.ml.log_classifier('x', 'y', 'yHat', trend=True, buffer_size=10)
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Columns to be extracted as features. Either the individual column name as a string, or the list of column names as strings, or a callable function to extract the relevant features. |
Required |
|
|
Column to be used as label data. String column name, or a callable function. |
Required |
|
|
A function as a string, or callable function used to append predictions to a batch of data. |
Required |
udf |
Deprecated, replaced by prediction. |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Boolean indicating if the model has a trend coefficient. |
|
|
|
Learning rate applied. |
|
|
|
Maximum possible number of iterations before the run is terminated. This does not guarantee convergence. |
|
|
|
Gradient tolerance, below which the run is terminated. seed: Random seed. |
|
|
|
Random seed. |
|
|
|
Regularization term as a |
|
|
|
Regularization coefficient. |
|
|
|
Elastic net mixing parameter. This is only used if penalty type |
|
|
|
Decay coefficient. |
|
|
|
Momentum coefficient. |
|
|
|
Integer value which defines the number of data points which must amass before linear regression is applied. |
|
Returns:
A pipeline comprised of a log_classifier
operator, which can be joined to other pipelines.
Examples:
Fit, predict, and update an online log classifier model on a stream:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.log_classifier('x', 'y', 'yHat', trend=True, buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': sorted(np.random.randn(10)),
'y': 5*[False]+5*[True]
})
>>> kx.q('publish', data)
x y yHat
-----------------
-1.094345 0 0
-0.9376853 0 0
-0.7055189 0 0
-0.6558396 0 0
-0.4485486 0 0
-0.3953883 1 0
0.03139856 1 1
0.5305628 1 1
0.8246773 1 1
1.471353 1 1
Min-Max Scale Numerical Columns
Applies min-max scaling to columns in the data.
.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
(`bufferSize; bufferSize);
(`rangeError; rangeError))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the input table whose values we want to scale. Can also be a dictionary mapping column names to the minimum and maximum values to use when scaling. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to observe before scaling the numeric columns in the data. If set to |
|
|
|
Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with the numeric columns now being scaled so their values lie between |
This operator scales a set of numeric columns based on a user-supplied data range or based on the minimum and maximum values in the data when the operator is applied. The operator will only be applied, and the minimum/maximum values decided upon, once the number of data point given to the model exceeds the value of the bufferSize parameter. This function can also be configured to error if data supplied after the ranges have been set falls outside this range.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Apply min-max scaling on all data.
// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[::]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)
Example 2: Apply min-max scaling on the specified columns.
// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[`x`x1]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)
Example 3: Apply min-max scaling on columns rating
and cost
, with supplied
minimum and maximum values for one column and the other based on a buffer.
// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to min-max scale its columns
publish ([] rating: 3 + 250?5; cost: 250?1000f)
Example 4: Error when passed batches containing data outside the min-max bounds.
// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
.qsp.write.toConsole[]
// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)
// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)
sp.ml.min_max_scaler('price')
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Column(s) to be min-max scaled in string format, or a dictionary with column names and |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of data points which must amass before min-max scaling is applied. |
|
|
|
Boolean indicating whether an error should be raised if new data falls outside the min-max range used for scaling. |
|
Returns:
A min_max_scaler
operator, which can be joined to other operators or pipelines.
Examples:
Performs min-max scaling on a column of a batch of data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.min_max_scaler('x')
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.random.randn(10),
'x1': np.random.randn(10)
})
>>> kx.q('publish', data)
x x1
---------------------
0.8952703 -0.03508657
1 1.588994
0.3312846 -0.7038333
0 -0.6815334
0.1397654 -0.8542387
0.6615214 0.3446243
0.200455 -0.3855581
0.3410276 -1.461748
0.9060353 1.485059
0.4513415 1.568283
Multi-Layer Perceptron Classifier
Applies a Multi-Layer Perceptron Classifier model to the data.
.qsp.ml.MLPClassifier[X;y;prediction;.qsp.use (!) . flip (
(`hiddenLayerSizes; hiddenLayerSizes);
(`activation ; activation);
(`solver; ; solver);
(`alpha ; alpha);
(`batchSize ; batchSize);
(`learningRate ; learningRate);
(`learningRateInit; learningRateInit);
(`powerT ; powerT);
(`maxIter ; maxIter);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the column which is to house the model's predicted label values for each data record OR a function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is |
|
|
|
Activation function used to transform the output of the hidden layers into a single scalar value. This value can be |
|
|
|
Optimization function used to search for the inputs that minimize/maximize the results of the model function. This value can be |
|
|
|
Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is |
|
|
|
Number of training examples used in each stochastic optimization iteration. Minimum value is |
|
|
|
Learning rate schedule for updating the weights of the neural network. Only used when the optimization function is set to |
|
|
|
Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to |
|
|
|
Exponent used to update the learning rate when the learning rate is set to |
|
|
|
Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with a multi-layer perceptron classifier model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.MLPClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.MLPClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
One-Hot Encode Categorical Columns
Encodes symbolic/string values to a binary representation per value.
.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the input table to one-hot encode. Can also be a dictionary mapping column names to their expected values whereby only columns with these names and values will be encoded. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to observe before one-hot encoding the symbol columns in the data. If set to |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with the symbol columns in the data now each being represented by multiple numeric columns populated by |
Encodes symbolic and string data as numeric representations.
When data is fed into the operator via a stream, the algorithm will only be applied to the data
when the number of records received has exceeded the value of the bufferSize
parameter.
When this happens, the buffered data is one-hot encoded.
If subsequent data is passed which contains symbols that were not present at the time of the
original fitting, these symbols will be mapped to 0
.
Examples:
Example 1: Encode all the symbolic or string columns.
// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[::]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)
Example 2: Encode column x
.
// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`x]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] x:10?`a`b`c; y:10?1f)
Example 3: Encode columns x
and x1
with a required buffer.
// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)
Example 4: Encode the columns axis
and status
using given values. This is useful when the
categories are known in advance, but may not be present in the training data.
// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)
Example 5: Encode column axis
and status
using hybrid method
// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`axis`status!(::; `normal`error)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)
sp.ml.one_hot_encode(['trader','broker'])
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Column(s) to be one-hot encoded in string format, or a dictionary with column names and expected symbol/string values as the key-value pairs to be encoded, or |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
The number of data points which must amass before symbolic data is encoded. |
|
Returns:
A one_hot_encode
operator, which can be joined to other operators or pipelines.
Examples:
Performs one-hot encoding on a batch of data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.one_hot_encode()
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': ['a','a','b','c','c','a','b','c','a','b'],
'x1': ['a','b','b','c','b','a','b','c','b','b'],
'x2': np.random.randn(10),
'x3': 5*["abc","def"]
})
>>> kx.q('publish', data)
x2 x_a x_b x_c x1_a x1_b x1_c x3_abc x3_def
---------------------------------------------------
0.9442428 1 0 0 1 0 0 1 0
-0.2004368 1 0 0 0 1 0 0 1
-0.3597777 0 1 0 0 1 0 1 0
-1.408116 0 0 1 0 0 1 0 1
-0.2833716 0 0 1 0 1 0 1 0
1.072297 1 0 0 1 0 0 0 1
0.01474962 0 1 0 0 1 0 1 0
0.4720782 0 0 1 0 0 1 0 1
-1.032484 1 0 0 0 1 0 1 0
-0.3735379 0 1 0 0 1 0 0 1
Predict Using Registry Model
Predicts a target variable using a model from the registry.
.qsp.ml.registry.predict[X;prediction];
.qsp.ml.registry.predict[X;prediction; .qsp.use (!) . flip (
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`version ; version))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column which is to house the model's predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the fitted model we want to load from the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be loaded from. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model we want to load is stored under. If set to |
|
|
|
Version of the fitted model we want to load from the registry. If set to |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted label values for each data point. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
.qsp.ml.registry.predict
will predict the target value for each record in the batch,
using a model from the registry.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Get predictions from an sklearn model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:desc n?10;x2:n?1f;y:asc n?5);
// Define and fit an sklearn model
features:flip value flip delete y from data;
clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf1:clf1[`max_depth pykw 3];
clf1[`:fit][features;data`y];
// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::];
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
`x`x1`x2
`yhat;
.qsp.use enlist[`model]!enlist["skModel"]
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to get predctions from the model
publish data;
Example 2: Get predictions from a q model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and fit a q model
features:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];
// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
`x`x1`x2;
`yhat;
.qsp.use enlist[`model]!enlist["kmeansModel"]
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor get predictions from the model
publish data;
Example 3: Get predictions from a q model by passing functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions to be passed to the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the predict model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
xFunc;
clustFunc;
.qsp.use enlist[`model]!enlist["kmeansModel"]
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data
sp.ml.predict(
['x','x1','x2'],
'yhat',
registry='/tmp',
model='skLC'
)
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Column name of the predictor variable, or the function required to generate predictors from a batch. |
Required |
|
|
Function for integrating the predictions into the batch, or a column name to join predictions to the table. |
Required |
|
Deprecated, replaced by prediction. |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Model within the registry to use for prediction. |
|
|
|
Registry to load models from. |
|
|
|
Experiment within the registry to load models from. |
|
|
|
Model version to load in for prediction. |
|
Returns:
A predict
operator, which can be joined to other operators or pipelines.
Examples:
Save a model to registry then retrieve it to serve predictions:
>>> from kxi import sp, ml
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> import sklearn.linear_model as sk
>>> data = pd.DataFrame({
'x': np.random.randn(500),
'x1': sorted(np.random.randn(500)),
'x2': np.random.randn(500),
'y': 250*[False]+250*[True]
})
>>> X=data[['x','x1','x2']].to_numpy()
>>> y=data['y'].to_numpy()
>>> model=sk.LogisticRegression().fit(X,y)
>>> ml.registry.set.model(model,"skLC",'sklearn',folder_path="/tmp")
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.predict(['x','x1','x2'],
'yhat',
registry='/tmp',
model='skLC')
| sp.write.to_console(timestamp='none'))
>>> kx.q('publish', data[['x','x1','x2']].iloc[:10])
x x1 x2 yhat
--------------------------------------
0.346299 -2.515892 0.2226648 0
0.4090534 -2.313049 -0.08848148 0
-1.102503 -2.303455 -0.4292562 0
0.05339919 -2.260012 0.02841283 0
-0.3747164 -2.257401 0.7677529 0
-0.08124184 -2.200042 -1.418281 0
1.393183 -2.198983 0.954484 0
0.3279165 -2.17734 0.08885551 0
0.1826349 -2.171023 -0.7898705 0
-2.060252 -2.113853 1.085127 0
Quadratic Discriminant Analysis Classifier
Applies a Quadratic Discriminant Analysis Classifier model to the data.
.qsp.ml.quadraticDiscriminantAnalysis[X;y;prediction;.qsp.use (!) . flip (
(`priors ; priors);
(`bufferSize; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with a quadratic discriminant analysis model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.quadraticDiscriminantAnalysis[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.quadraticDiscriminantAnalysis[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Random Forest Classifier
Applies a Random Forest Classifier model to the data.
.qsp.ml.randomForestClassifier[X;y;prediction.;.qsp.use (!) . flip (
(`nEstimators ; nEstimators);
(`criterion ; criterion);
(`maxDepth ; maxDepth);
(`minSamplesSplit ; minSamplesSplit);
(`minSamplesLeaf ; minSamplesLeaf);
(`minWeightFractionLeaf; minWeightFractionLeaf);
(`maxFeatures ; maxFeatures);
(`maxLeafNodes ; maxLeafNodes);
(`minImpurityDecrease ; minImpurityDecrease);
(`bootstrap ; bootstrap)
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is |
|
|
|
Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be |
|
|
|
Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to |
|
|
|
Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is |
|
|
|
Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is |
|
|
|
Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. When the |
|
|
|
Maximum number of features to consider when looking for the best way to split a node. This value can be |
|
|
|
Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If set to |
|
|
|
Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is |
|
|
|
Whether bootstrap samples are used when building trees. If |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted class labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
This ensures that a minimum of n
samples are used to train the model.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, Update and Predict with a random forest classification model.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);
// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.randomForestClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.randomForestClassifier[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Random Forest Regressor
Applies a Random Forest Regressor model to the data.
.qsp.ml.randomForestRegressor[X;y;prediction]
.qsp.ml.randomForestRegressor[X;y;prediction;.qsp.use (!) . flip (
(`nEstimators ; nEstimators);
(`criterion ; criterion);
(`minSamplesSplit; minSamplesSplit);
(`minSamplesLeaf ; minSamplesLeaf);
(`maxDepth ; maxDepth);
(`bufferSize ; bufferSize);
(`modelInit ; modelInit);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`config ; registryConfig)]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is |
|
|
|
Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be |
|
|
|
Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is |
|
|
|
Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is |
|
|
|
Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
Returns:
type |
description |
---|---|
|
Returns the input data with an additional column containing the model's predicted target values. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n
samples.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit a random forest regression model on data and store model in local registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);
// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.randomForestRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"RafrModel")]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
| x x1 x2 y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248 0.591584 0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808 0.3408518 0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502 0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611 0.2372084 0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186 0.1082126 0.9550528
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.randomForestRegressor[xFunc;yFunc;predFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
Coming Soon
Score Model
Score the performance of a model.
.qsp.ml.score[y;predictions;metric]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the column which houses the model's predictions OR a user-defined function that will generate predictions from the input data. |
Required |
|
|
Metric to use to compare the predictions with the |
Required |
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
The evaluation score given by the metric. |
Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.
The following metrics are currently supported:
-
f1
-
accuracy
-
mse
-
rmse
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fits a scikit-learn model, predict y
, and calculate
the cumulative F1 score of the model on receipt of new data.
// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split the data into a training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
// Train the model
features:flip value flip delete y from training;
targets :training`y;
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];
// Set model within the existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];
// Define and run a stream processor pipeline using the score operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
{delete y from x};
`pred;
.qsp.use enlist[`model]!enlist"skModel"]
.qsp.ml.score[`y; `pred; `f1]
.qsp.write.toConsole[];
// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing;
Example 2: Fit a q model, predict y
, and score the
cumulative accuracy on receipt of new data.
// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
// Train the model
features:flip value flip delete y from training;
targets:training`y;
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];
// Set model within the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]
// Define and run a stream processor pipeline using the score operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
{delete y from x};
`pred;
.qsp.use enlist[`model]!enlist"myModel"]
.qsp.ml.score[`y; `pred; `accuracy]
.qsp.write.toConsole[]
// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing
sp.ml.score('y', 'yhat', 'mse')
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Column containing the true values as a string, or a callable function to retrieve the true values from the batch. |
Required |
|
|
Column containing the predicted values as a string, or a callable function to retrieve the predicted values from the batch. |
Required |
|
|
A |
Required |
Returns:
A score
operator, which can be joined to other operators or pipelines.
Examples:
Computes the mean squared error between true and predicted results:
>>> from kxi import sp, ml
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> import sklearn.linear_model as sk
>>> data = pd.DataFrame({
'x' :np.random.randn(500),
'x1':sorted(np.random.randn(500)),
'x2':np.random.randn(500),
'y' :250*[False] + 250*[True]
})
>>> training = data.iloc[:450].reset_index(drop=True)
>>> testing = data.iloc[450:].reset_index(drop=True)
>>> X = training[['x','x1','x2']].to_numpy()
>>> y = training['y'].to_numpy()
>>> model = sk.LogisticRegression().fit(X,y)
>>> ml.registry.set.model(model,"skLC",'sklearn',folder_path="/tmp")
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.predict(['x', 'x1', 'x2'], 'yhat', registry='/tmp', model='skLC')
| sp.ml.score('y', 'yhat', 'mse')
| sp.write.to_console(timestamp='none'))
>>> kx.q('publish', testing)
0f
Sequential K-Means Clustering
Applies a Sequential K-Means Clustering algorithm to the data.
.qsp.ml.sequentialKMeans[X;cluster]
.qsp.ml.sequentialKMeans[X;cluster; .qsp.use (!) . flip (
(`df ; df);
(`k ; k);
(`centers ; centers);
(`init ; init);
(`alpha ; alpha);
(`forgetful ; forgetful);
(`bufferSize; bufferSize);
(`model ; model);
(`registry ; registry);
(`experiment; experiment);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Name of the column which is to house the model's predicted class labels. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Distance function used to measure the distance between points when clustering. This can be |
|
|
|
Final number of clusters to be defined by the model. Minimum value is |
|
|
|
A dictionary mapping each cluster to the cluster centroid value that we want these clusters to initialize with. |
|
|
|
Initialization method for the cluster centroids. This value can either be K-means++ ( |
|
|
|
Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to |
|
|
|
Whether to apply forgetful Sequential K-Means ( |
|
|
|
Number of records to observe before fitting the model. If set to |
|
|
|
Name of the model to be stored in the registry. If set to |
|
|
|
Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the fitted model is to be stored under. If set to |
|
|
|
Configuration used for fitting the model. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Null during initial fitting. Afterwards returns the input data with an additional column containing the model's predicted cluster labels. |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or cluster
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the cluster
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the cluster
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;clusters;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
The sequential K-Means algorithm is applied within a streaming framework.
When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize
parameter (n
).
When training, all data in the batch which causes the buffered data to exceed n
elements is included in fitting.
As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit, update, and predict with the sequential K-Means model.
// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.sequentialKMeans[`x`x1`x2; `cluster; .qsp.use enlist[`bufferSize]!enlist 100]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish ([]100?1f;100?1f;100?1f);
// Now that the bufferSize has been reached, we can retrieve predictions using this fit model by passing new data
publish ([] 50?1f; 50?1f; 50?1f);
Example 2: Pass functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions that will be passed as the model arguments
xFunc: {[data]
select x, x1 from data
};
clustFunc: {[data;clusters;modelInfo]
update newClust: clusters from data
};
// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.sequentialKMeans[xFunc;clustFunc]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data;
sp.ml.sequential_k_means(['x', 'x1', 'x2'], 'myClusters', buffer_size=10)
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Columns to extract from data for clustering. |
Required |
|
|
Column name for integrating the predictions into the batch. |
Required |
|
Deprecated, replaced by cluster. |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Similarity measure used for clustering points; a |
|
|
|
Number of cluster centers. |
|
|
|
Centers used for initialization of algorithm. |
|
|
|
Boolean indicating how to initialize cluster centroids - K-means++ ( |
|
|
|
Learning rate value between 0-1 used to define how much past centroid information to retain. |
|
|
|
Apply forgetful ( |
|
|
|
Integer value defining the number of data points which must amass before initial clustering is applied. |
|
Returns:
A pipeline comprised of a sequential_k_means
operator, which can be joined to other pipelines.
Examples:
Discover clusters in a batch of data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.sequential_k_means(['x', 'x1', 'x2'], 'myClusters', buffer_size=10)
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': np.random.randn(10),
'x1': np.random.randn(10),
'x2': np.random.randn(10)
})
>>> kx.q('publish', data)
x x1 x2 myClusters
---------------------------------------------
-0.03652799 -0.120031 -2.991009 1
-1.290761 0.1046565 0.007084028 0
1.005217 -0.8182222 -0.9399138 0
0.4791692 0.1235131 -0.6635202 0
-0.1670914 -0.2006089 0.5007579 0
-0.3650267 -1.105404 0.988489 2
-0.2914796 0.3655497 2.741289 2
-0.5778521 -1.107383 -1.215941 0
1.126198 -0.3380897 1.903199 2
-0.2179497 0.6098755 0.2427696 0
Standardize Numerical Columns
Apply standardization to the data.
.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Name of the column(s) in the input table to standardize. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
Number of records to observe before standardizing the numerical columns in the data. If set to |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
Returns the input data with the numeric columns now having a mean value of |
Standardize a user-specified set of columns in an input table.
When data is fed into this operator via a stream, the algorithm
will only scale the data when the number of records received has
exceeded the value of the bufferSize
parameter.
Once this happens, the mean and standard deviation of each column
is computed.
These statistics are then used on subsequent batches which are
normalized by subtracting this mean value and dividing the result
by the standard deviation value.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Applies standardization to all data.
// Define and run a stream processor pipeline using the standardize operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.standardize[::]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)
Example 2: Apply standardization to specified columns.
// Define and run a stream processor pipeline using the standardize operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.standardize[`x`x1]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)
Example 3: This pipeline applies standardization to all columns based on a buffer.
// Define and run a stream processor pipeline using the standardize operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.standardize[::; .qsp.use enlist[`bufferSize]!enlist 200]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to standardize its columns
publish ([] length: 100 + 250?2f; width: 10 + 250?1f);
sp.ml.standardize()
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Either column(s) to be standard scaled in string format, or |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
The number of data points which must amass before standard scaling is applied. |
|
Returns:
A standardize
operator, which can be joined to other operators or pipelines.
Examples:
Apply standard scaling to a batch of data:
>>> from kxi import sp
>>> import pykx as kx
>>> import numpy as np
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.ml.standardize()
| sp.write.to_console(timestamp='none'))
>>> data = pd.DataFrame({
'x': 10*np.random.randn(10),
'x1': 100*np.random.randn(10),
'x2': 20*np.random.randn(10)
})
>>> kx.q('publish', data)
x x1 x2
----------------------------------
1.480517 -0.6734071 -0.9577657
0.2441469 0.2270268 1.673357
-0.06926182 -1.072527 0.03043792
0.8506082 -0.6297122 -0.8603573
0.9383493 2.502053 0.2302604
0.2230779 -0.1797943 -0.04656411
-1.520206 0.6135481 -1.625691
0.3302589 -1.047889 -0.6386158
-0.6906624 0.401498 0.982027
-1.786828 -0.1407956 1.212912
Update Registry Model
Trains a model incrementally and returns predictions for each record in a batch.
.qsp.ml.registry.update[X;y;prediction]
.qsp.ml.registry.update[X;y;prediction; .qsp.use (!) . flip (
(`untrained ; untrained);
(`modelArgs ; modelArgs);
(`model ; model);
(`registry ; registry);
(`experiment ; experiment);
(`version ; version);
(`config ; config))]
Parameters:
name |
type |
description |
default |
---|---|---|---|
|
|
Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to |
Required |
|
|
Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use. |
Required |
|
|
Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to |
Required |
options:
name |
type |
description |
default |
---|---|---|---|
|
|
An untrained q model that we want to update. If set to |
|
|
|
Number of records to observe before updating the model. If set to |
|
|
|
List of argument passed to the model to help configure the updating process. |
|
|
|
Name of the model to be loaded from/stored in the registry. If no value is supplied, the model will not be loaded from/stored in the registry. |
|
|
|
Location of the registry where the model to be loaded is stored/the updated model is to be stored. This can be a local path or a cloud storage path. If set to |
|
|
|
Name of the experiment in the registry that the model is to be loaded from/the updated model is to be stored under. If no value is supplied, the model will be loaded from/stored under |
|
|
|
Version of the model we want to load from the registry. If set to |
|
|
|
Dictionary used to configure additional settings when saving the model to the registry. |
|
For all common arguments, refer to configuring operators
Returns:
type |
description |
---|---|
|
The current batch, modified in accordance with the |
Note
Passing functions as the values for the model parameters
Functions can be passed as the value for the X
, y
, or prediction
model parameters.
These functions have two different forms depending on whether they are values for the X
and y
parameter or for the prediction
parameter.
Functions for the X
and y
model parameters take one argument:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
This function is used to extract lists of values from the input data and takes the following form:
func: {[data]
...
}
Functions for the prediction
model parameter takes four arguments:
name |
type |
description |
---|---|---|
|
|
Batch passed to the operator, only the data not the metadata. |
|
|
Target variable supplied to the model as the |
|
|
Model's predictions for each record in the batch. |
|
|
Information about the model. Currently not used and always set to |
This function is used to add a set of aggregate predictions to the output table and takes the following form:
func: {[data;y;predictions;modelInfo]
...
}
select
, exec
, update
, and delete
statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.
Train a model incrementally returning predictions for each record in a batch.
Python support
Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.
Examples:
The following examples outline the use of the functionality described above.
Example 1: Fit an untrained q model which can be updated.
// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);
// Define and run a stream processor pipeline using the update model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.update[
`x`x1;
`y;
`yhat;
.qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to update the model
publish data;
Example 2: Fit an untrained model by passing functions as the model arguments.
// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);
// Define the functions to be passed to the model arguments
xFunc: {[data]
select x, x1 from data
};
yFunc: {[data]
delete x,x1,x2 from data // this is the same as 'select y from data' as data only has 4 columns
};
predFunc: {[data;y;predictions;modelInfo]
update newPred: predictions from data
};
// Define and run a stream processor pipeline using the update model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.update[
xFunc;
yFunc;
predFunc;
.qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to fit the model
publish data
Example 3: Update a q model from the registry.
// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);
// Generate data to be used for updating
n:100000;
data2:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and fit a q model
features:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];
// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];
// Define optional model fitting parameters
optKeys:`model`registry`experiment;
optVals:("kmeansModel";::;::);
opt:optKeys!optVals;
// Define and run a stream processor pipeline using the update model operator
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.update[
`x`x1;
`y;
`yhat;
.qsp.use opt
]
.qsp.write.toConsole[];
// Pass a batch of data to the stream processor to update the model
publish data2;
// Call the get model store function to show the original and updated models have been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
Coming Soon