Batch S3 ingestion and preprocessing

This page describes how to read bulk data from S3, apply scaling and preprocessing steps

Motivation

The purpose of this example is to provide a user with an example workflow showing the application of preprocessing steps commonly applied to data in machine learning use-cases prior to persistence. This workflow is important when a user is intending to make use of this bulk data repeatedly by a data-science team relying on use of clean, scaled data.

Example

This example follows closely the S3 ingestion examples outlined here with the application of a number of preprocessing steps to data making use of Machine Learning functionality defined here. Namely in this case preprocessing data prior to persistence to save feature sets for centralised machine learning tasks.

Follow the steps here if you don't already have it.

View the unpacked package.

bash

Copy
export PKG=s3-ingest
kxi package unpack $PKG-$KX_VERSION.kxi

The default pipeline spec is written in q and is located under s3-ingest/src/ml.q in the unpacked package. The Python pipeline spec is located under s3-ingest/src/ml.py.

These update the existing pipeline specs to;

  • remove any null values

  • remove any infinities

  • normalises values by scaling them into the 0-1 range

Deploying

Authenticate with kdb Insights Enterprise and deploy the sample package.

Note

Cleaning up resources

Make sure to teardown any previously installed version of the package and clean up resources before deploying a new version.

Once deployed, the pipeline starts to ingest data and write to the database.

q

Python

Switch the pipeline to use the q ML spec.

Replace the values of base and spec in s3-ingest/pipelines/transport.yaml.

YAML

Copy
base: q-ml
..
src: src/ml.q

Then deploy the package using the commands below.

bash

Copy
kxi auth login
kxi pm push $PKG
kxi pm deploy $PKG

Switch the pipeline to use the Python ML spec.

Replace the values of base and spec in s3-ingest/pipelines/transport.yaml.

YAML

Copy
base: py-ml
..
src: src/ml.py

Then deploy the package using the commands below

bash

Copy
kxi auth login
kxi pm push $PKG
kxi pm deploy $PKG

Checking progress

To check progress of the ingest and validate success, you should follow the steps described here.

Teardown

Tear down the package and data using the command below.

bash

Copy
kxi pm teardown --rm-data $PKG

Summary

The above workflow shows the ingestion and modification of a file stored in S3 prior to persistence simulating the update/cleaning of centralised data for use by a team of data-scientists.

To validate that the pipeline has been applied appropriately querying the columns 'ehail_fee', 'total', 'extra' and 'fare' should only return values between 0 and 1 with no infinite values or nulls.

Further Reading