Persisting to object storage

kdb Insights Enterprise can persist data to object storage for inexpensive, durable, long-term data storage. Persisting to object storage has major cost saving implications, and should be considered for older data sets.

Immutable data

Data in object storage cannot be modified. Primarily data that is meant for long term storage should go here. If the data is modified/deleted externally, you should restart any HDB pods, and drop cache.

Object storage tiers vs mounting an object storage database

An object storage tier is a read-write location and is configured differently than a read-only database. See querying object storage for an example of how to query an existing object storage database.

An object storage tier has to be the final tier in your storage configuration.

Authentication

You will need to provide authentication details to access your private buckets. Object storage tiers can be configured using environment variables for routing details and for authentication. Authentication can also be configured using service accounts.

Authentication

For more information on service accounts see automatic registration and environment variables.

variable

description

default

AWS_REGION

AWS S3 buckets can be tied to a specific region or set of regions. This value indicates what regional endpoint to use to read and write data from the bucket.

us-east-1

AWS_ACCESS_KEY_ID

AWS account access key ID.

AWS_SECRET_ACCESS_KEY

AWS account secert access key.

See this guide on service accounts in EKS on how to use an IAM role for authorization.

variable

description

default

AZURE_STORAGE_ACCOUNT

The storage account name to use for routing Azure blob storage requests.

AZURE_STORAGE_SHARED_KEY

A shared key for the blob storage container to write data to.

See this guide on service accounts in AKS to use an IAM role for authorization.

variable

description

default

GCLOUD_PROJECT_ID

Name of the Google Cloud project ID to use when interfacing with cloud storage.

See this guide on service accounts in GKE on how to use an IAM role for authorization.

Setting environment variables

In a Docker Compose configuration, the environment variables for authentication need to be set on the Storage Manager and Data Access processes.

services:
 dap:
 image: ${kxi_da_single}
 env_file: .env
 environment:
 - AWS_REGION=us-east-2
 - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
 - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
 sm:
 image: ${kxi_sm_single}
 env_file: .env
 environment:
 - AWS_REGION=us-east-2
 - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
 - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

Performance considerations

The layout of your table should be considered before uploading to object storage. Neglecting to apply attributes, as well as not sorting symbols or partitioning by date can result in any query becoming a linear scan that pulls back the whole data set.

Attributes and table layouts are specified in the tables section of the assembly

Deployment

To persist data to cloud storage, you must deploy an assembly with an HDB tier that has a storage key. This setting will create a segmented database where the HDB loads on-disk and cloud storage data together seamlessly.

Data in object storage is immutable, and cannot be refactored after it has been written. For this reason, date-based partitions bound for object storage can only be written down once per day.

For example, this Storage Manager tier configuration will:

  • Store data in memory for 10 minutes

  • Keep the last 2 days of data on disk

  • Migrate data older than 2 days to object storage

sm:
 tiers:
 - name: streaming
 mount: rdb
 - name: interval
 mount: idb
 schedule:
 freq: 00:10:00
 - name: ondisk
 mount: hdb
 schedule:
 freq: 1D00:00:00
 snap: 01:35:00
 retain:
 time: 2 days
 - name: s3
 mount: hdb
 store: s3://examplebucket/db

Mounts

You may see other kdb Insights Enterprise examples referring to object type mounts. Those mounts are for reading from existing cloud storage. For writing your own database to storage, no object mounts are involved, only HDB tier settings to make your HDB a segmented database.

For a guide on reading from an existing object storage database see querying object storage

Example using explicit credentials

This example assumes your data is published into a topic called south.

The name of the example bucket is s3://examplebucket, and we want to save our database to a root folder called db. Our example uses the AWS_REGION: us-east-2. Stream processor and table schema sections have been omitted.

First create a new secret with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

kubectl create secret generic aws-access-secret\
 --from-literal=AWS_ACCESS_KEY_ID=${MY_KEY}\
 --from-literal=AWS_SECRET_ACCESS_KEY=${MY_SECRET}

Now create your assembly, setting sm.tiers to write to an S3 store, and environment variables for storage.

 mounts:
 rdb:
 type: stream
 baseURI: none
 partition: none
 dependency:
 - idb
 idb:
 type: local
 baseURI: file:///data/db/idb
 partition: ordinal
 hdb:
 type: local
 baseURI: file:///data/db/hdb
 partition: date
 dependency:
 - idb
 sm:
 source: south
 tiers:
 - name: streaming
 mount: rdb
 - name: interval
 mount: idb
 schedule:
 freq: 00:10:00
 - name: ondisk
 mount: hdb
 schedule:
 freq: 1D00:00:00
 snap: 01:35:00
 retain:
 time: 2 Days
 - name: s3
 mount: hdb
 store: s3://examplebucket/db
 dap:
 instances:
 idb:
 mountName: idb
 hdb:
 mountName: hdb
 rdb:
 tableLoad: empty
 mountName: rdb
 source: south

The name of the example storage container is ms://mycontainer, and we want to save our database to a folder within the container called db. Our example uses the AZURE_STORAGE_ACCOUNT: iamanexample. This means we will write our database to a folder called 'db', inside the container 'mycontainer' for the storage account iamanexample. Stream processor and table schema sections have been omitted.

First create a new secret with your AZURE_STORAGE_SHARED_KEY.

kubectl create secret generic azure-storage-secret --from-literal=AZURE_STORAGE_SHARED_KEY=${MY_KEY}

Now create your assembly, setting sm.tiers to write to an Azure store container, and environment variables for storage.

 mounts:
 rdb:
 type: stream
 baseURI: none
 partition: none
 dependency:
 - idb
 idb:
 type: local
 baseURI: file:///data/db/idb
 partition: ordinal
 hdb:
 type: local
 baseURI: file:///data/db/hdb
 partition: date
 dependency:
 - idb
 sm:
 source: south
 tiers:
 - name: streaming
 mount: rdb
 - name: interval
 mount: idb
 schedule:
 freq: 00:10:00
 - name: ondisk
 mount: hdb
 schedule:
 freq: 1D00:00:00
 snap: 01:35:00
 retain:
 time: 2 Days
 - name: s3
 mount: hdb
 store: ms://mycontainer/db
 dap:
 instances:
 idb:
 mountName: idb
 hdb:
 mountName: hdb
 rdb:
 tableLoad: empty
 mountName: rdb
 source: south