Database Creation and Management

This notebook provides a walkthrough of some of the functionality available for users looking to create and maintain large databases using PyKX.

This notebook refers to creating and maintaining large partitioned kdb+ databases using PyKX. Go to Q for Mortals for more in-depth information about partitioned databases in kdb+.

You can download this walkthrough as a .ipynb notebook file.

This walkthrough provides examples of the following tasks:

  1. Creating a database from a historical dataset

  2. Adding a new partition to the database

  3. Managing the on-disk database by:

    • Renaming a table and column

    • Creating a copy of a column to the database

    • Applying a Python function to a column of the database

    • Updating the data type of a column

  4. Adding a new table to the most recent partition of the database

For full information on the functions available, go to the API section.

Initial setup

Import all required libraries and create a temporary directory which will be used to store the database we create for this walkthrough.

In [2]:

Python

Copy
import os
os.environ['PYKX_BETA_FEATURES'] = 'true'

import pykx as kx
from datetime import date
import tempfile
In [3]:
Copy
tempdir = tempfile.TemporaryDirectory()

Database interactions are facilitated through use of the pykx.DB class. All methods/attributes used in this notebook are contained within this class.

Initialise the DB class to start. The expected input is the file path where you intend to save the partitioned database and its associated tables. In this case we're going to use the temporary directory we just created.

In [4]:

Python

Copy
db = kx.DB(path = tempdir.name + '/db')

For details on any methods contained within this class, use the help method.

In [5]:

Python

Copy
help(db.create)

Help on method create in module pykx.db:

Copy
create(table, table_name, partition, *, by_field=None, sym_enum=None, log=True) method of pykx.db.DB instance

Create an on-disk partitioned table within a kdb+ database from a supplied pykx.Table object. Once generated this table will be accessible as an attribute of the DB class or a sub attribute of DB.table.

Parameters:

  • table: The pykx.Table object which is to be persisted to disk

  • partition: The name of the column which is to be used to partition the data if

  • table_name: The name with which the table will be persisted and accessible once loaded and available as a pykx.PartitionedTablesupplied as a str or if supplied as non string object this will be used as the partition to which all data is persisted

  • by_field: A field of the table to be used as a by column, this column will be the second column in the table (the first being the virtual column determined by the partitioning column)

  • sym_enum: The name of the symbol enumeration table to be associated with the table

  • log: Print information about status of partitioned datab

Returns: a None object on successful invocation, the database class will be updated to contain attributes associated with the available created table

Examples:

Generate a partitioned table from a table containing multiple partitions

Python

Copy
    >>> import pykx as kx
    >>> db = kx.DB(path = 'newDB')
    >>> N = 1000
    >>> qtab = kx.Table(data = {
    ...     'date': kx.q.asc(kx.random.random(N, kx.q('2020.01 2020.02 2020.03'))),
    ...     'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
    ...     'price': kx.random.random(N, 10.0),
    ...     'size': kx.random.random(N, 100)
    ... })
    >>> db.create(qtab, 'stocks', 'date', by_field = 'sym', sym_enum = 'symbols')
    >>> db.tables
    ['stocks']
    >>> db.stocks
    pykx.PartitionedTable(pykx.q('
    month   sym  price     size
    ---------------------------
    2020.01 AAPL 7.979004  85
    2020.01 AAPL 5.931866  55
    2020.01 AAPL 5.255477  49
    2020.01 AAPL 8.15255   74
    2020.01 AAPL 4.771067  80
    ..
    '))

Add a table as a partition to an on-disk database, in the example below we are adding a partition to the table generated above

Python

Copy
    >>> import pykx as kx
    >>> db = kx.DB(path = 'newDB')
    >>> N = 333
    >>> qtab = kx.Table(data = {
    ...     'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
    ...     'price': kx.random.random(N, 10.0),
    ...     'size': kx.random.random(N, 100)
    ... })
    >>> compress = kx.Compress(kx.CompressionAlgorithm.gzip, level=2)
    >>> db.create(qtab, 'stocks', kx.q('2020.04m'), compress=compress)
    >>> kx.q('{-21!hsym x}', '/tmp/newDB/2020.04/stocks/price')
    pykx.Dictionary(pykx.q('
    compressedLength  | 2064
    uncompressedLength| 2680
    algorithm         | 2i
    logicalBlockSize  | 17i
    zipLevel          | 2i
    '))
    ```

Create the sample dataset

Create a dataset called trades containing time-series data spanning multiple dates, and columns of various types:

In [6]:

Python

Copy
N = 1000000
trades = kx.Table(data={
     'date': kx.random.random(N, [date(2020, 1, 1), date(2020, 1, 2)]),
     'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
     'price': kx.random.random(N, 10.0),
     'size': kx.random.random(N, 1000)
})

Create the database

Create the database using the date column as the partition, and add trades as a table called trade_data within it.

In [7]:

Python

Copy
db.create(trades, 'trade_data', 'date')
 

Writing Database Partition 2020.01.01 to table trade_data

Writing Database Partition 2020.01.02 to table trade_data

This now exists as a table and is saved to disk.

In [8]:

Python

Copy
db.tables
Out [8]:

['trade_data']

When a table is saved, an attribute is added to the db class for it. For our newly generated table, this is db.trade_data.

In [9]:

Python

Copy
db.trade_data
Out [9]:
  date sym price size
0 2020.01.01 MSFT 7.079266 800
1 2020.01.01 AAPL 1.824321 65
2 2020.01.01 MSFT 2.408259 292
3 2020.01.01 GOOG 1.675438 7
4 2020.01.01 AAPL 8.311168 183
5 2020.01.01 AAPL 2.208693 989
6 2020.01.01 MSFT 6.068126 567
7 2020.01.01 AAPL 4.918926 794
8 2020.01.01 AAPL 9.331869 39
9 2020.01.01 AAPL 1.142611 507
10 2020.01.01 AAPL 2.685874 581
11 2020.01.01 AAPL 3.483591 163
12 2020.01.01 AAPL 0.4422525 466
13 2020.01.01 MSFT 7.406654 976
14 2020.01.01 MSFT 2.493871 171
15 2020.01.01 AAPL 9.242088 28
16 2020.01.01 MSFT 0.3954522 747
17 2020.01.01 MSFT 0.3441191 512
18 2020.01.01 GOOG 9.662762 998
19 2020.01.01 AAPL 9.601674 812
20 2020.01.01 AAPL 4.969858 910
21 2020.01.01 GOOG 1.048204 830
22 2020.01.01 GOOG 0.9817644 595
... ... ... ... ...
999999 2020.01.02 GOOG 1.470716 636

1,000,000 rows × 4 columns

Add a new partition to the database

Once a table has been generated, you can add more partitions to the database through reuse of the create method. In this case we are adding the new partition 2020.01.03 to the database.

In [10]:

Python

Copy
N = 10000
new_day = kx.Table(data={
    'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
    'price': kx.random.random(N, 10.0),
    'size': kx.random.random(N, 100)
})
db.create(new_day, 'trade_data', date(2020, 1, 3))
 

Writing Database Partition 2020-01-03 to table trade_data

Manage the database

This section covers updating the contents of a database. The examples below demonstrate a number of common tasks that would be completed regularly when updating a database.

The name of a table can be updated using the rename_table method. Below, we are updating the table trade_data to be called trade.

In [11]:

Python

Copy
db.rename_table('trade_data', 'trades')
 

2025.02.14 12:03:53 renaming :/tmp/tmp4594s3s9/db/2020.01.01/trade_data to :/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:53 renaming :/tmp/tmp4594s3s9/db/2020.01.02/trade_data to :/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:53 renaming :/tmp/tmp4594s3s9/db/2020.01.03/trade_data to :/tmp/tmp4594s3s9/db/2020.01.03/trades

During the rename process, the attribute in the db class is also updated.

In [12]:

Python

Copy
db.trades
Out [12]:
  date sym price size
0 2020.01.01 MSFT 7.079266 800
1 2020.01.01 AAPL 1.824321 65
2 2020.01.01 MSFT 2.408259 292
3 2020.01.01 GOOG 1.675438 7
4 2020.01.01 AAPL 8.311168 183
5 2020.01.01 AAPL 2.208693 989
6 2020.01.01 MSFT 6.068126 567
7 2020.01.01 AAPL 4.918926 794
8 2020.01.01 AAPL 9.331869 39
9 2020.01.01 AAPL 1.142611 507
10 2020.01.01 AAPL 2.685874 581
11 2020.01.01 AAPL 3.483591 163
12 2020.01.01 AAPL 0.4422525 466
13 2020.01.01 MSFT 7.406654 976
14 2020.01.01 MSFT 2.493871 171
15 2020.01.01 AAPL 9.242088 28
16 2020.01.01 MSFT 0.3954522 747
17 2020.01.01 MSFT 0.3441191 512
18 2020.01.01 GOOG 9.662762 998
19 2020.01.01 AAPL 9.601674 812
20 2020.01.01 AAPL 4.969858 910
21 2020.01.01 GOOG 1.048204 830
22 2020.01.01 GOOG 0.9817644 595
... ... ... ... ...
1009999 2020.01.03 AAPL 9.750387 99

1,010,000 rows × 4 columns

To rename a column in a table, use the rename_column method. For example, let's rename the sym column (in the trade table) to ticker.

In [13]:

Python

Copy
db.rename_column('trades', 'sym', 'ticker')
 

2025.02.14 12:03:54 renaming sym to ticker in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 renaming sym to ticker in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 renaming sym to ticker in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

In [14]:

Python

Copy
db.trades
Out [14]:
  date ticker price size
0 2020.01.01 MSFT 7.079266 800
1 2020.01.01 AAPL 1.824321 65
2 2020.01.01 MSFT 2.408259 292
3 2020.01.01 GOOG 1.675438 7
4 2020.01.01 AAPL 8.311168 183
5 2020.01.01 AAPL 2.208693 989
6 2020.01.01 MSFT 6.068126 567
7 2020.01.01 AAPL 4.918926 794
8 2020.01.01 AAPL 9.331869 39
9 2020.01.01 AAPL 1.142611 507
10 2020.01.01 AAPL 2.685874 581
11 2020.01.01 AAPL 3.483591 163
12 2020.01.01 AAPL 0.4422525 466
13 2020.01.01 MSFT 7.406654 976
14 2020.01.01 MSFT 2.493871 171
15 2020.01.01 AAPL 9.242088 28
16 2020.01.01 MSFT 0.3954522 747
17 2020.01.01 MSFT 0.3441191 512
18 2020.01.01 GOOG 9.662762 998
19 2020.01.01 AAPL 9.601674 812
20 2020.01.01 AAPL 4.969858 910
21 2020.01.01 GOOG 1.048204 830
22 2020.01.01 GOOG 0.9817644 595
... ... ... ... ...
1009999 2020.01.03 AAPL 9.750387 99

1,010,000 rows × 4 columns

To safely apply a function to modify the price column within the database, first create a copy of the column.

In [15]:

Python

Copy
db.copy_column('trades', 'price', 'price_copy')
Out [15]:

2025.02.14 12:03:54 copying price to price_copy in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 copying price to price_copy in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 copying price to price_copy in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

In [16]:

Python

Copy
db.trades
Out [16]:
  date ticker price size price_copy
0 2020.01.01 MSFT 7.079266 800 7.079266
1 2020.01.01 AAPL 1.824321 65 1.824321
2 2020.01.01 MSFT 2.408259 292 2.408259
3 2020.01.01 GOOG 1.675438 7 1.675438
4 2020.01.01 AAPL 8.311168 183 8.311168
5 2020.01.01 AAPL 2.208693 989 2.208693
6 2020.01.01 MSFT 6.068126 567 6.068126
7 2020.01.01 AAPL 4.918926 794 4.918926
8 2020.01.01 AAPL 9.331869 39 9.331869
9 2020.01.01 AAPL 1.142611 507 1.142611
10 2020.01.01 AAPL 2.685874 581 2.685874
11 2020.01.01 AAPL 3.483591 163 3.483591
12 2020.01.01 AAPL 0.4422525 466 0.4422525
13 2020.01.01 MSFT 7.406654 976 7.406654
14 2020.01.01 MSFT 2.493871 171 2.493871
15 2020.01.01 AAPL 9.242088 28 9.242088
16 2020.01.01 MSFT 0.3954522 747 0.3954522
17 2020.01.01 MSFT 0.3441191 512 0.3441191
18 2020.01.01 GOOG 9.662762 998 9.662762
19 2020.01.01 AAPL 9.601674 812 9.601674
20 2020.01.01 AAPL 4.969858 910 4.969858
21 2020.01.01 GOOG 1.048204 830 1.048204
22 2020.01.01 GOOG 0.9817644 595 0.9817644
... ... ... ... ... ...
1009999 2020.01.03 AAPL 9.750387 99 9.750387

1,010,000 rows × 4 columns

You can now apply a function to the copied column without the risk of losing the original data. Below, let's modify the copied column by multiplying the contents by 2.

In [17]:

Python

Copy
db.apply_function('trades', 'price_copy', kx.q('{2*x}'))
 

2025.02.14 12:03:54 resaving column price_copy (type 9) in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 resaving column price_copy (type 9) in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 resaving column price_copy (type 9) in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

In [18]:

Python

Copy
db.trades
Out [18]:
  date ticker price size price_copy
0 2020.01.01 MSFT 7.079266 800 14.15853
1 2020.01.01 AAPL 1.824321 65 3.648642
2 2020.01.01 MSFT 2.408259 292 4.816519
3 2020.01.01 GOOG 1.675438 7 3.350875
4 2020.01.01 AAPL 8.311168 183 16.62234
5 2020.01.01 AAPL 2.208693 989 4.417385
6 2020.01.01 MSFT 6.068126 567 12.13625
7 2020.01.01 AAPL 4.918926 794 9.837851
8 2020.01.01 AAPL 9.331869 39 18.66374
9 2020.01.01 AAPL 1.142611 507 2.285222
10 2020.01.01 AAPL 2.685874 581 5.371748
11 2020.01.01 AAPL 3.483591 163 6.967183
12 2020.01.01 AAPL 0.4422525 466 0.8845049
13 2020.01.01 MSFT 7.406654 976 14.81331
14 2020.01.01 MSFT 2.493871 171 4.987742
15 2020.01.01 AAPL 9.242088 28 18.48418
16 2020.01.01 MSFT 0.3954522 747 0.7909045
17 2020.01.01 MSFT 0.3441191 512 0.6882382
18 2020.01.01 GOOG 9.662762 998 19.32552
19 2020.01.01 AAPL 9.601674 812 19.20335
20 2020.01.01 AAPL 4.969858 910 9.939716
21 2020.01.01 GOOG 1.048204 830 2.096408
22 2020.01.01 GOOG 0.9817644 595 1.963529
... ... ... ... ... ...
1009999 2020.01.03 AAPL 9.750387 99 19.50077

1,010,000 rows × 4 columns

Once you are happy with the new values within the price_copy column, you can safely delete the price column, then rename the price_copy column to be called price.

In [19]:

Python

Copy
db.delete_column('trades', 'price')
db.rename_column('trades', 'price_copy', 'price')
 

2025.02.14 12:03:54 deleting column price from `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 deleting column price from `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 deleting column price from `:/tmp/tmp4594s3s9/db/2020.01.03/trades

2025.02.14 12:03:54 renaming price_copy to price in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 renaming price_copy to price in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 renaming price_copy to price in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

In [20]:

Python

Copy
db.trades
Out [20]:
  date ticker size price
0 2020.01.01 MSFT 800 14.15853
1 2020.01.01 AAPL 65 3.648642
2 2020.01.01 MSFT 292 4.816519
3 2020.01.01 GOOG 7 3.350875
4 2020.01.01 AAPL 183 16.62234
5 2020.01.01 AAPL 989 4.417385
6 2020.01.01 MSFT 567 12.13625
7 2020.01.01 AAPL 794 9.837851
8 2020.01.01 AAPL 39 18.66374
9 2020.01.01 AAPL 507 2.285222
10 2020.01.01 AAPL 581 5.371748
11 2020.01.01 AAPL 163 6.967183
12 2020.01.01 AAPL 466 0.8845049
13 2020.01.01 MSFT 976 14.81331
14 2020.01.01 MSFT 171 4.987742
15 2020.01.01 AAPL 28 18.48418
16 2020.01.01 MSFT 747 0.7909045
17 2020.01.01 MSFT 512 0.6882382
18 2020.01.01 GOOG 998 19.32552
19 2020.01.01 AAPL 812 19.20335
20 2020.01.01 AAPL 910 9.939716
21 2020.01.01 GOOG 830 2.096408
22 2020.01.01 GOOG 595 1.963529
... ... ... ... ...
1009999 2020.01.03 AAPL 99 19.50077

1,010,000 rows × 4 columns

To convert the data type of a column, use the set_column_type method. Before we do that, let's look at the metadata information for the table using the meta method:

In [21]:

Python

Copy
kx.q.meta(db.trades)
Out [21]:
  t f a
c      
date "d"    
ticker "s"    
size "j"    
price "f"    

Currently the size column is the type LongAtom. Let's update this to be a type ShortAtom:

In [22]:

Python

Copy
db.set_column_type('trades', 'size', kx.ShortAtom)
Out [22]:

2025.02.14 12:03:54 resaving column size (type 5) in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 resaving column size (type 5) in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 resaving column size (type 5) in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

Now let's apply the grouped attribute to the size column. For more information on attributes in kdb+, refer to the Q for Mortals Attributes section.

In [23]:

Python

Copy
db.set_column_attribute('trades', 'ticker', 'grouped')
 

2025.02.14 12:03:54 resaving column ticker (type 20) in `:/tmp/tmp4594s3s9/db/2020.01.01/trades

2025.02.14 12:03:54 resaving column ticker (type 20) in `:/tmp/tmp4594s3s9/db/2020.01.02/trades

2025.02.14 12:03:54 resaving column ticker (type 20) in `:/tmp/tmp4594s3s9/db/2020.01.03/trades

Let's revisit the metadata of the table to ensure they have been applied correctly.

In [4]:

Python

Copy
kx.q.meta(db.trades)
Out [4]:
  t f a
c      
date "d"    
ticker "s"   g
size "h"    
price "f"    

Onboard your next table

Now that you have successfully set up one table, you may want to add a second table. We follow the same method as before and create the quotes table using the create method. In this example, the quotes table only contains data for 2020.01.03:

In [25]:

Python

Copy
quotes = kx.Table(data={
    'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
    'open': kx.random.random(N, 10.0),
    'high': kx.random.random(N, 10.0),
    'low': kx.random.random(N, 10.0),
    'close': kx.random.random(N, 10.0)
})
In [26]:

Python

Copy
db.create(quotes, 'quotes', date(2020, 1, 3), by_field = 'sym')
 

Writing Database Partition 2020-01-03 to table quotes

All tables within a database must contain the same partition structure. To ensure you can access the new table, the quotes table needs to exist in every partition within the database, even if there is no data for that partition. This is called backfilling data. For the partitions where the quotes table is missing, we use the fill_database method:

In [27]:

Python

Copy
db.fill_database()
 

Successfully filled missing tables to partition: :/tmp/tmp4594s3s9/db/2020.01.02

Successfully filled missing tables to partition: :/tmp/tmp4594s3s9/db/2020.01.01

Now that the database has resolved the missing tables within the partitions, we can view the new quotes table:

In [28]:

Python

Copy
db.quotes
Out [28]:
  date sym price size
0 2020.01.01 MSFT 7.079266 800
1 2020.01.01 AAPL 1.824321 65
2 2020.01.01 MSFT 2.408259 292
3 2020.01.01 GOOG 1.675438 7
4 2020.01.01 AAPL 8.311168 183
5 2020.01.01 AAPL 2.208693 989
6 2020.01.01 MSFT 6.068126 567
7 2020.01.01 AAPL 4.918926 794
8 2020.01.01 AAPL 9.331869 39
9 2020.01.01 AAPL 1.142611 507
10 2020.01.01 AAPL 2.685874 581
11 2020.01.01 AAPL 3.483591 163
12 2020.01.01 AAPL 0.4422525 466
13 2020.01.01 MSFT 7.406654 976
14 2020.01.01 MSFT 2.493871 171
15 2020.01.01 AAPL 9.242088 28
16 2020.01.01 MSFT 0.3954522 747
17 2020.01.01 MSFT 0.3441191 512
18 2020.01.01 GOOG 9.662762 998
19 2020.01.01 AAPL 9.601674 812
20 2020.01.01 AAPL 4.969858 910
21 2020.01.01 GOOG 1.048204 830
22 2020.01.01 GOOG 0.9817644 595
... ... ... ... ...
999999 2020.01.02 GOOG 1.470716 636

10,000 rows × 6 columns

Finally, to view the amount of saved data, count the number of rows per partition using partition_count:

In [29]:

Python

Copy
db.partition_count()
Out [29]:
  quotes trades
2020.01.01 0 500425
2020.01.02 0 499575
2020.01.03 10000 10000

Clean up temporary database created

In [30]:

Python

Copy
tempdir.cleanup()