SingleStore Managed Service

CREATE PIPELINE ... WITH TRANSFORM

Creates a pipeline that uses a transform. A transform is an optional user-provided program, such as a Python script. A transform receives data from a pipeline’s extractor, shapes the data, and returns the results to the pipeline. The pipeline then loads the result data into the database.

A transform is one of three methods you can use to shape data ingested from a pipeline.

Notice

SingleStore Managed Service does not support transforms.

Syntax

CREATE PIPELINE ... WITH TRANSFORM ('uri', 'program', 'arguments [...]'): Each of the transform’s parameters are described below:

  • uri: The transform’s URI is the location from where the user-provided program can be downloaded, which is specified as either an http:// or file:// endpoint. If the URI contains a tarball with a .tar.gz or .tgz extension, its contents will be automatically extracted. If the uri contains a tarball, the program parameter must also be specified. Alternatively, if the URI specifies the user-provided program filename itself (such as file://localhost/root/path/to/my-transform.py), the program and arguments parameters can be empty.

  • program: The filename of the user-provided program to run. This parameter is required if a tarball was specified as the endpoint for the transform’s url. If the url specifies the user-provided program file itself, this parameter can be empty.

  • arguments: A series of arguments that are passed to the transform at runtime. Each argument must be delimited by a space.

Note

For information on creating a pipeline other than using the WITH TRANSFORM clause, see CREATE PIPELINE.

WITH TRANSFORM('http://memsql.com/my-transform.py','','')
WITH TRANSFORM('file://localhost/root/path/to/my-transform.py','','')
WITH TRANSFORM('http://memsql.com/my-transform-tarball.tar.gz', 'my-transform.py','')
WITH TRANSFORM('http://memsql.com/my-transform-tarball.tar.gz', 'my-transform.py', '-arg1 -arg1')
Remarks
  • During pipeline creation, a cluster’s master aggregator distributes the transform to each leaf node in the cluster. Each leaf node then executes the transform every time a batch partition is processed.

  • When the CREATE PIPELINE statement is executed, the transform must be accessible at the specified file system or network endpoint. If the transform is unavailable, pipeline creation will fail.

  • Depending on your desired language used to write the transform and your desired platform used to deploy the transform, any virtual machine overhead may greatly reduce a pipeline’s performance. Transforms are executed every time a batch partition is processed, which can be many times per second. Virtual machine overhead will reduce the execution speed of a transform, and thus degrade the performance of the entire pipeline.

  • You must install any required dependencies for your transform (such as Python) on each leaf node in your cluster. Test out your pipeline by running TEST PIPELINE before running START PIPELINE to make sure your nodes are set up properly.

  • Transforms can be written in any language, but the SingleStore DB node’s host Linux distribution must have the required dependencies to execute the transform. For example, if you write a transform in Python, the node’s Linux distribution must have Python installed and configured before it can be executed.

  • At the top of your transform file, use a shebang to specify the interpreter to use to execute the script (e.g. #!/usr/bin/env python3 for Python 3 or #!/usr/bin/env ruby for Ruby).

  • Use Unix line endings in your transform file.

  • A transform reads from stdin to receive data from a pipeline’s extractor. After shaping the input data, the transform writes to stdout, which returns the results to the pipeline.

  • Transactional guarantees apply to data written to stdout, only. There are no transactional guarantees for any side effects that are coded in the transform logic.