Stats

This page details the statistical functions available in the Stream Processor.

Describe

Computes the requested descriptive statistics on the provided columns.

.qsp.stats.describe[fields; stats]

Parameters:

name

type

description

default

fields

symbol or symbol[]

A list of column names to compute statistics on

Required

stats

symbol, symbol[], or list of tuples and symbols

A list of statistics which should be computed

Required

Statistic Options

name

type

description

minimum

symbol

Computes the maximum of each provided column

maximum

symbol

Computes the minimum of each provided column

range

symbol

Computes the range of each provided column

length

symbol

Counts the length of the batch provided

total

symbol

Computes the total sum of each provided column

average

symbol

Computes the average of each provided column

numDistinct

symbol

Counts the number of distinct elements in each provided column

numNull

symbol

Counts the number of null elements in each provided column

numInfinity

symbol

Counts the number of infinite elements in each provided column

median

symbol

Computes the median of each provided column

quartiles

symbol

Computes the quartiles of each provided column

frequency

symbol

Creates a frequency dictionary for each provided column

mode

symbol

Computes all modes of each provided column

sampleVar

symbol

Computes the sample variance of each provided column

sampleStd

symbol

Computes the sample standard deviation of each provided column

populationVar

symbol

Computes the population variance of each provided column

populationStd

symbol

Computes the population standard deviation of each provided column

standardError

symbol

Computes the standard error of each provided column

skew

symbol

Computes the Fisher-Pearson coefficient of skewness of each provided column

percentiles

tuple

Computes the specified percentiles on each provided column

Note: some statistics do not support categorical data and will return generic null for said data

For all common arguments, refer to configuring operators

This example computes the min, max, and average on a batch of data

.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.stats.describe[`y; `minimum`maximum`average]
 .qsp.write.toVariable[`output];
publish ([] x: til 5; y: 10 13 1 9 8)
output
Expected output: ([] minimum_y: enlist 1; maximum_y: enlist 13; average_y: enlist 8.2)

This example demonstrates how to use the percentiles option The operator below will compute the mode and skew along with the 90th, 95th and 99th percentile.

Enlist for percentiles

If only percentiles are to be computed, the tuple must be enlisted.

.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.stats.describe[`x; (`mode; `skew; (`percentiles; 0.9 0.95 0.99))]
 .qsp.write.toVariable[`output];
publish ([] x: til 100)
output
sp.stats.describe('price', 'average')

Parameters:

name

type

description

default

fields

symbol or symbol[]

A list of column names on which to compute the statistics

Required

stats

symbol, symbol[], or list of tuples and symbols

A list of statistics that should be computed

Required

Returns: A pipeline comprised of a describe operator, which can be joined to other pipelines.

A list of all supported statistic options can be found below:

name

type

description

minimum

string

Computes the maximum of each provided column

maximum

string

Computes the minimum of each provided column

range

string

Computes the range of each provided column

length

string

Counts the length of the batch provided

total

string

Computes the total sum of each provided column

average

string

Computes the average of each provided column

numDistinct

string

Counts the number of distinct elements in each provided column

numNull

string

Counts the number of null elements in each provided column

numInfinity

string

Counts the number of infinite elements in each provided column

median

string

Computes the median of each provided column

quartiles

string

Computes the quartiles of each provided column

frequency

string

Creates a frequency dictionary for each provided column

mode

string

Computes all modes of each provided column

sampleVar

string

Computes the sample variance of each provided column

sampleStd

string

Computes the sample standard deviation of each provided column

populationVar

string

Computes the population variance of each provided column

populationStd

string

Computes the population standard deviation of each provided column

standardError

string

Computes the standard error of each provided column

skew*

string

Computes the skewness of each provided column

percentiles

tuple

Computes the specified percentiles on each provided column

*calculated using the Fisher-Pearson coefficient of skewness

Categorical Data

Some statistics do not support categorical data and will return generic null for said data

>>> from kxi import sp
>>> import pykx as kx
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.describe('x', 'average')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x':[5,1,4,2,3],
'y':[100,100,200,50,50]
})
>>> kx.q('publish', data)
average_x
---------
3

Using percentiles along with other stats

>>> from kxi import sp
>>> import pykx as kx
>>> sp.run(sp.read.from_expr('([] x: 1 2 2 3 3 3 4 4 4 4)')
| sp.stats.describe('x', ['mode', 'skew', ('percentiles', [0.9, 0.95, 0.99])])
| sp.write.to_variable('out'))
>>> kx.q('out')
mode_x skew_x    percentile_0.9_x percentile_0.95_x percentile_0.99_x
---------------------------------------------------------------------
4      -0.512289 4                4                 4

Exponential Moving Average

Calculates the exponential moving average.

.qsp.stats.ema[X; alpha; y]

Parameters:

name

type

description

default

X

symbol or symbol[]

A list of column names on which to compute the average

Required

alpha

float

The decay rate

Required

y

symbol or symbol[]

The columns to write to. These can overwrite existing columns

The same as X

For all common arguments, refer to configuring operators

This example replaces the columns x and y with their exponential moving averages.

.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.stats.ema[`x`y; .33]
 .qsp.write.toConsole[];
publish ([] x: til 10; y: 0 1 4 2 5 3 6 7 9 8)
sp.stats.ema('volume', 0.33, 'res')

Parameters:

name

type

description

default

X

symbol or symbol[]

A single column name or list of column names on which to compute the statistics

Required

alpha

float

The decay rate to use

Required

y

symbol or symbol[]

A single column name or list of column names to output results to

The same as X

Number of input/output columns

The number of source and destination columns must match

Returns: A pipeline comprised of a ema operator, which can be joined to other pipelines.

>>> from kxi import sp
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.ema('x', 0.33, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': [1, 50, 3, 4, 5, 6]
})
>>> kx.q('publish', data)
x  res
-----------
1  1
50 17.17
3  12.4939
4  9.690913
5  8.142912
6  7.435751

Simple Moving Average

Computes a moving average by record count.

.qsp.stats.sma[X; n; y]

Parameters:

name

type

description

default

X

symbol or symbol[]

A list of column names on which to compute the average

Required

n

long

The number of records to include in the average

Required

y

symbol or symbol[]

The columns to write to. These can overwrite existing columns

The same as X

For all common arguments, refer to configuring operators

This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points.

This example replaces each value in y with the simple moving average of that value and the nine prior values.

.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.stats.sma[`y; 10]
 .qsp.write.toConsole[];
publish ([] x: til 10; y: 0 1 4 2 5 3 6 7 9 8)
sp.stats.sma('price', 60, 'movingAvgPrice')

Parameters:

name

type

description

default

X

symbol or symbol[]

A single column name or list of column names on which to compute the statistics

Required

window

long

The size of the window which should be used to calculate the average

Required

y

symbol or symbol[]

A single column name or list of column names to output results to

The same as X

Number of input/output columns

The number of source and destination columns must match

Returns: A pipeline comprised of a sma operator, which can be joined to other pipelines.

>>> from kxi import sp
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.sma('x', 3, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': [1, 50, 3, 4, 5, 6]
})
>>> kx.q('publish', data)
x  res
-------
1  1
50 25.5
3  18
4  19
5  4
6  5

Time Weighted Average

Computes a running time-weighted average.

.qsp.stats.twa[X; times; range; y]

Parameters:

name

type

description

default

X

symbol or symbol[]

A list of column names on which to compute the average

Required

times

symbol

The name of the column containing the time data

Required

range

long, int or short

The number of records to include in the average

Required

y

symbol or symbol[]

The columns to write to. These can overwrite existing columns

Same as X

For all common arguments, refer to configuring operators

This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points weighted by the time deltas found in times.

Data must be sorted

The incoming data must be sorted, because the average is calculated using the deltas between each timestamp. Out of order data would cause negative weight to be applied to the calculation.

This example replaces each value in y with the time weighted average of that value and the nine prior values using weights derived from the time column.

.qsp.run
 .qsp.read.fromCallback[`publish]
 // The windowing is to ensure that records are sorted by timestamp
 .qsp.window.tumbling[00:01:00; `time; .qsp.use `sort`lateness!(1b; 00:00:10)]
 .qsp.stats.twa[`data; `time; 10]
 .qsp.write.toConsole[]
publish ([] time: 0p + 00:00:01 * 0 5 6 17 14 21 57 58 71;
 data: 10 20 10 9 11 8 21 10 9)

This example replaces each value in c and in d with the time weighted average of the values within a and b respectively and four prior values using the times column as a series of times.

.qsp.run
 .qsp.read.fromCallback[`publish]
 .qsp.window.tumbling[00:00:01; `time; .qsp.use `sort`lateness!(1b; 00:00:01)]
 .qsp.stats.twa[`a`b; `time; 5; `c`d]
 .qsp.write.toConsole[];
publish ([] time: 0p + 00:00:00.1 * 0 8 13 17 19 21; a: 1 7 8 7 7 8; b: til 6);
sp.stats.twa('price', 'time', 60, '1minMovingAvgPrice')

name

type

description

default

X

symbol or symbol[]

A single column name or list of column names on which to compute the statistics

Required

times

symbol or symbol[]

A list of times to be used for weighting

Required

window

timespan

The size of the window which should be used to calculate the average

Required

y

symbol or symbol[]

A single column name or list of column names to output results to

Same as X

Number of input/output columns

The number of source and destination columns must match

This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points weighted by the time deltas found in times.

Data must be sorted

The incoming data must be sorted, because the average is calculated using the deltas between each timestamp. Out of order data would cause negative weight to be applied to the calculation.

Returns: A pipeline comprised of a twa operator, which can be joined to other pipelines.

Examples:

>>> from kxi import sp
>>> from datetime import timedelta
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.twa('x', 'time', 3, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': range(1,6),
'time': [timedelta(seconds=x) for x in [0, 5, 6, 14, 17]]
})
>>> kx.q('publish', data)
x  time                 res
--------------------------------
1  0D00:00:00.000000000 1
2  0D00:00:05.000000000 2
3  0D00:00:06.000000000 2.166667
4  0D00:00:14.000000000 3.214286
5  0D00:00:17.000000000 4.166667