Database Creation and Management
This notebook provides a walkthrough of some of the functionality available for users looking to create and maintain large databases using PyKX.
This notebook refers to creating and maintaining large partitioned kdb+ databases using PyKX. Go to Q for Mortals for more in-depth information about partitioned databases in kdb+.
You can download this walkthrough as a .ipynb
notebook file.
This walkthrough provides examples of the following tasks:
-
Creating a database from a historical dataset
-
Adding a new partition to the database
-
Managing the on-disk database by:
-
Renaming a table and column
-
Creating a copy of a column to the database
-
Applying a Python function to a column of the database
-
Updating the data type of a column
-
-
Adding a new table to the most recent partition of the database
For full information on the functions available, go to the API section.
Initial setup
Import all required libraries and create a temporary directory which will be used to store the database we create for this walkthrough.
In [2]:
|
Python Copy
|
In [3]:
|
Copy
|
Database interactions are facilitated through use of the pykx.DB
class. All methods/attributes used in this notebook are contained within this class.
Initialise the DB
class to start. The expected input is the file path where you intend to save the partitioned database and its associated tables. In this case we're going to use the temporary directory we just created.
In [4]:
|
Python Copy
|
For details on any methods contained within this class, use the help
method.
In [5]:
|
Python Copy
|
Help on method create in module pykx.db
:
create(table, table_name, partition, *, by_field=None, sym_enum=None, log=True) method of pykx.db.DB instance
Create an on-disk partitioned table within a kdb+ database from a supplied pykx.Table
object. Once generated this table will be accessible as an attribute of the DB
class or a sub attribute of DB.table
.
Parameters:
-
table: The
pykx.Table
object which is to be persisted to disk -
partition: The name of the column which is to be used to partition the data if
-
table_name: The name with which the table will be persisted and accessible once loaded and available as a
pykx.PartitionedTable
supplied as astr
or if supplied as non string object this will be used as the partition to which all data is persisted -
by_field: A field of the table to be used as a by column, this column will be the second column in the table (the first being the virtual column determined by the partitioning column)
-
sym_enum: The name of the symbol enumeration table to be associated with the table
-
log: Print information about status of partitioned datab
Returns: a None
object on successful invocation, the database class will be updated to contain attributes associated with the available created table
Examples:
Generate a partitioned table from a table containing multiple partitions
Python
>>> import pykx as kx
>>> db = kx.DB(path = 'newDB')
>>> N = 1000
>>> qtab = kx.Table(data = {
... 'date': kx.q.asc(kx.random.random(N, kx.q('2020.01 2020.02 2020.03'))),
... 'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
... 'price': kx.random.random(N, 10.0),
... 'size': kx.random.random(N, 100)
... })
>>> db.create(qtab, 'stocks', 'date', by_field = 'sym', sym_enum = 'symbols')
>>> db.tables
['stocks']
>>> db.stocks
pykx.PartitionedTable(pykx.q('
month sym price size
---------------------------
2020.01 AAPL 7.979004 85
2020.01 AAPL 5.931866 55
2020.01 AAPL 5.255477 49
2020.01 AAPL 8.15255 74
2020.01 AAPL 4.771067 80
..
'))
Add a table as a partition to an on-disk database, in the example below we are adding a partition to the table generated above
Python
>>> import pykx as kx
>>> db = kx.DB(path = 'newDB')
>>> N = 333
>>> qtab = kx.Table(data = {
... 'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
... 'price': kx.random.random(N, 10.0),
... 'size': kx.random.random(N, 100)
... })
>>> compress = kx.Compress(kx.CompressionAlgorithm.gzip, level=2)
>>> db.create(qtab, 'stocks', kx.q('2020.04m'), compress=compress)
>>> kx.q('{-21!hsym x}', '/tmp/newDB/2020.04/stocks/price')
pykx.Dictionary(pykx.q('
compressedLength | 2064
uncompressedLength| 2680
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 2i
'))
```
Create the sample dataset
Create a dataset called trades
containing time-series data spanning multiple dates, and columns of various types:
In [6]:
|
Python Copy
|
Create the database
Create the database using the date
column as the partition, and add trades
as a table called trade_data
within it.
In [7]:
|
Python Copy
|
|
This now exists as a table and is saved to disk.
In [8]:
|
Python Copy
|
Out [8]:
|
|
When a table is saved, an attribute is added to the db
class for it. For our newly generated table, this is db.trade_data
.
In [9]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [9]:
|
1,000,000 rows × 4 columns |
Add a new partition to the database
Once a table has been generated, you can add more partitions to the database through reuse of the create
method. In this case we are adding the new partition 2020.01.03
to the database.
In [10]:
|
Python Copy
|
|
Manage the database
This section covers updating the contents of a database. The examples below demonstrate a number of common tasks that would be completed regularly when updating a database.
The name of a table can be updated using the rename_table
method. Below, we are updating the table trade_data
to be called trade
.
In [11]:
|
Python Copy
|
|
During the rename process, the attribute in the db
class is also updated.
In [12]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [12]:
|
1,010,000 rows × 4 columns |
To rename a column in a table, use the rename_column
method. For example, let's rename the sym
column (in the trade
table) to ticker
.
In [13]:
|
Python Copy
|
|
In [14]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [14]:
|
1,010,000 rows × 4 columns |
To safely apply a function to modify the price
column within the database, first create a copy of the column.
In [15]:
|
Python Copy
|
Out [15]:
|
|
In [16]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [16]:
|
1,010,000 rows × 4 columns |
You can now apply a function to the copied column without the risk of losing the original data. Below, let's modify the copied column by multiplying the contents by 2.
In [17]:
|
Python Copy
|
|
In [18]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [18]:
|
1,010,000 rows × 4 columns |
Once you are happy with the new values within the price_copy
column, you can safely delete the price
column, then rename the price_copy
column to be called price
.
In [19]:
|
Python Copy
|
|
In [20]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [20]:
|
1,010,000 rows × 4 columns |
To convert the data type of a column, use the set_column_type
method. Before we do that, let's look at the metadata information for the table using the meta
method:
In [21]:
|
Python Copy
|
||||||||||||||||||||||||
Out [21]:
|
|
Currently the size
column is the type LongAtom
. Let's update this to be a type ShortAtom
:
In [22]:
|
Python Copy
|
Out [22]:
|
|
Now let's apply the grouped
attribute to the size column. For more information on attributes in kdb+, refer to the Q for Mortals Attributes section.
In [23]:
|
Python Copy
|
|
Let's revisit the metadata of the table to ensure they have been applied correctly.
In [4]:
|
Python Copy
|
||||||||||||||||||||||||
Out [4]:
|
|
Onboard your next table
Now that you have successfully set up one table, you may want to add a second table. We follow the same method as before and create the quotes
table using the create
method. In this example, the quotes
table only contains data for 2020.01.03
:
In [25]:
|
Python Copy
|
In [26]:
|
Python Copy
|
|
All tables within a database must contain the same partition structure. To ensure you can access the new table, the quotes
table needs to exist in every partition within the database, even if there is no data for that partition. This is called backfilling data. For the partitions where the quotes
table is missing, we use the fill_database
method:
In [27]:
|
Python Copy
|
|
Now that the database has resolved the missing tables within the partitions, we can view the new quotes
table:
In [28]:
|
Python Copy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Out [28]:
|
10,000 rows × 6 columns |
Finally, to view the amount of saved data, count the number of rows per partition using partition_count
:
In [29]:
|
Python Copy
|
||||||||||||
Out [29]:
|
|
Clean up temporary database created
In [30]:
|
Python Copy
|