Azure Data Factory

kdb Insights Enterprise can integrate with Azure Data Factory (ADF). You can use ADF to ingest any data into kdb Insights Enterprise, particularly if you have data in a format for which there is no native reader in kdb Insights Enterprise today. Your ADF pipeline can transform the data into a supported format and then trigger the ingestion into kdb Insights Enterprise.

Introduction

This example shows you how to use ADF to watch a specific container within an Azure Storage Account and trigger when .csv files, containing data to be ingested, are uploaded.

When the ADF pipeline is triggered, the following sequence of events occurs:

  1. The ADF pipeline authenticates against a target kdb Insights Enterprise.

  2. With the destination database running, the kdb Insights Enterprise pipeline that is defined as part of the factory definition, is triggered. This kdb Insights Enterprise pipeline does the following:

    1. Reads the recently uploaded .csv file from blob storage

    2. Attempts to ingest the data into a specified table within the targeted database using automated packaging.

  3. The kdb Insights Enterprise package status is monitored during execution.

  4. Upon completion of the kdb Insights Enterprise package execution, it is torn down and you can query it using any of the querying methods available, including the UI and REST.

Warning

The schema of the .csv must match the schema of the table being written to. Type conversion is not performed by the pipeline.

Limitations

  • ADF requires the destination database to be running.

  • ADF requires the destination database to contain the table with the schema that matches the .csv.

  • The kdb Insights Enterprise pipeline uses HDB Direct Write. Historical data is written directly to disk, but data cannot be queried until the pipeline has finished running and postprocessing is completed. Data does not go through RDB or IDB.

Prerequisites

Ensure the following prerequisites are met before deploying the example:

  • az cli should be configured locally.

  • An Azure storage account and container from which to read .csv files must exist.

  • A resource group to deploy the ADF into should exist. The kdb Insights Enterprise resource group may be used for this, or you can create a new group.

    Note

    Reducing latency when using a separate resource group.

    If a separate group is required, consider creating it in the same location to reduce latency.

    The following command can be used to create a new group:

    bash

    Copy
    az group create -l $location -g $adfResourceGroupName
  • The following files must be downloaded and accessible to the az client:

  • Configure the parameters defined in the json file.

  • kdb Insights Enterprise must be deployed and running.

  • The target database must either be in a ready state or stopped. As mentioned previously, if it is not running, it will be started by the factory.

  • The schema, table and database being written to must exist.

    Warning

    The schema of the .csv must match the schema of the table being written to for the ingestion to be successful. Type conversion is not performed by the pipeline.

  • A user or service account must exist. This client is used by the factory to authenticate against kdb Insights Enterprise. It must have the following application roles in Keycloak as a minimum:

    • insights.builder.assembly.get

    • insights.builder.assembly.list

    • insights.builder.schema.get

    • insights.builder.schema.list

    • insights.pipeline.get

    • insights.pipeline.create

    • insights.pipeline.delete

    • insights.pipeline.status

    • insights.builder.assembly.deploy

    • insights.query.data

Parameters

The downloaded version of main.parameters.json needs to be updated with the following parameters. They are divided into two main categories:

  • Variables required to correctly interact with kdb Insights Enterprise:

    • basewUrl: Base kdb Insights Enterprise URL. Ensure this does not contain a trailing slash.

    • clientId: Keycloak client ID as configured during prerequisites

    • clientSecret: Keycloak client secret as configured during prerequisites

    • tableName: Table to write to

    • packageName: Name of the package ADF will create and teardown during the workflow

    • destinationPackageName: Target package name containing database, table and schema to write to

  • Variables required to correctly interact with Azure:

    • containerName: Name of the container to be monitored for .csv file uploads

    • storageAccountKey: Shared key used to access the storage account

    • triggerScope: the resource ID of the storage account to be monitored for uploads

      This can be retrieved using:

      bash

      Copy
      az storage account show \
      -g $storageAccountResourceGroup \
      -n $storageAccountName --query id -o tsv

Instructions

  1. Ensure the az cli is configured to operate on the desired subscription:

    bash

    Copy
    az account set --subscription $adfSubscriptionId
  2. Deploy ADF. Run the following command to input the parameters using stdin:

    bash

    Copy
    az deployment group create \
    --resource-group $adfResourceGroupName \
    --template-file adf.bicep \
    --parameters main.parameters.json

    Warning

    Keep a note of the resource group name the factory is being deployed to, as it is required in the next step.

    Tip

    Alternatively, combine with the -p flag to set individual values, or use the parameters file with @main.parameters.json syntax.

    Upon completion of this step, ADF will be accessible from the Resource Group. Launching the studio allows the pipelines to be inspected in the Author view:

    ADF pipelines

  3. Activate the trigger. This action is required for ADF to start listening to upload events on the storage account. The resource group needs to be the name of the group used in the previous step.

    bash

    Copy
    az datafactory trigger start \
    --factory-name $adfFactoryName \
    --resource-group $adfResourceGroupName \
    --name 'csv_blob_created_trigger'

If these steps complete without error, .csv files are ready to be uploaded.

Monitoring

The ADF pipeline runs can be monitored using the Monitor section in ADF:

ADF pipeline runs

The Monitor section of ADF shows the following upon success:

ADF successful pipeline runs

Note

The kdb Insights Enterprise pipeline is torn down upon successful completion. If this does not occur, the pipeline remains running in kdb Insights Enterprise and logs should be inspected to determine the cause.