SingleStore Managed Service

Load Data from AWS Glue

AWS Glue is a fully managed serverless data integration service that allows users to extract, transform, and load (ETL) from various data sources for analytics and data processing. AWS Glue runtime supports connectivity to a variety of data sources.

SingleStore provides a SingleStore connector for AWS Glue based on Apache Spark Datasource, available through AWS Marketplace. VPC networking and integration with AWS Secrets Manager for authentication credentials are supported for the connection.

This document shows how you can use SingleStore’s AWS Glue Marketplace Connector in AWS Glue Studio to create ETL jobs connecting to your SingleStore cluster using an easy-to-use graphical interface.

The following architecture diagram shows SingleStore connecting with AWS Glue for an ETL job.

16083b576a82e3.png

This integration supports SingleStore Managed Service and SingleStore DB version 7.1 and later.

Configuration Overview

This document walks you through connecting a SingleStore cluster in an AWS Glue ETL job as the source, transforming the data, and storing it back on a SingleStore cluster and in Apache Parquet format on Amazon S3. Here, we use the TPC-H Benchmark dataset that is available as a sample dataset in SingleStore.

To successfully create the ETL job using the SingleStore connector:

  1. Store authentication credentials in Secrets Manager.

  2. Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job.

  3. Configure the SingleStore connector and connection.

  4. Create an ETL job using the SingleStore connection in AWS Glue Studio.

Storing Authentication Credentials in Secrets Manager

AWS Glue provides integration with AWS Secrets Manager to securely store connection authentication credentials. Follow these steps to create these credentials:

  1. On the Secrets Manager console, choose Store a new secret.

  2. For Select a secret type, select Other type of secrets.

  3. For Secret key/value, set one row for each of the following parameters:

    • ddlEndpoint

    • database

    • user

    • password

  4. Choose Next.

    16083b576b0665.png
  5. For Secret name, enter aws-glue-singlestore-connection-info.

  6. Choose Next.

  7. Keep the Disable automatic rotation check box selected.

  8. Choose Next.

  9. Choose Store.

Creating an IAM Role for the AWS Glue ETL Job

In this section, you create a role with an attached policy to allow read-only access to credentials that are stored in Secrets Manager for the AWS Glue ETL job.

  1. On the IAM console, choose Policies.

  2. Choose Create policy.

  3. On the JSON tab, enter the following JSON snippet, providing your Region and account ID:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ],
                "Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:aws-glue-*"
            }
        ]
    }
    
  4. Choose Review Policy.

  5. Give your policy a name, for example, GlueAccessSecretValue.

  6. In the navigation pane, choose Roles.

  7. Choose Create role.

  8. For Select type of trusted entity, choose AWS service.

  9. Choose Glue.

    16083b576ba774.png
  10. Choose Next: Permissions.

  11. Search for the AWS managed policies AWSGlueServiceRole and AmazonEC2ContainerRegistryReadOnly, and select them.

  12. Search for the GlueAccessSecretValue policy created before, and select it.

  13. For Role name, enter a name, for example, GlueCustomETLConnectionRole.

  14. Confirm the three policies are selected.

    16083b576c278d.png
Configuring your SingleStore Connector and Connection

To subscribe to the SingleStore connector and configure the connection, complete the following steps:

  1. On the AWS Glue console, choose AWS Glue Studio.

  2. Choose Connectors.

  3. Choose Go to AWS Marketplace.

  4. Subscribe to the SingleStore connector for AWS Glue from AWS Marketplace.

  5. Open the following link in your web browser to Activate the connector from AWS Glue Studio:

    https://console.aws.amazon.com/gluestudio/home?#/connector/add-connection?connectorName="SingleStore connector for AWS Glue"&connectorType="Spark"&connectorDescription="The SingleStore connector for AWS Glue allows you to connect to data in AWS Glue"&connectorUrl="https%3A%2F%2F117940112483.dkr.ecr.us-east-1.amazonaws.com%2F9b446989-6efd-4ed3-a213-c4bf90cba48e%2Fcg-4240541378%2Fmemsql-spark%3A3.0.4-latest"&connectorVersion="3.0.4"&connectorClassName="com.memsql.spark"
    
  6. In the navigation pane, under Connectors, choose Create connection.

  7. Enter a name for the connection, such as SingleStore_connection.

  8. For AWS Secret, choose the AWS secret value aws-glue-singlestore-connection-info created before.

    16083b576cbb64.png
  9. Choose Create connection.

Creating an ETL job using the SingleStore connection in AWS Glue Studio

Now that you have set up authentication and configured the SingleStore connector, you can create an ETL job using that connection.

  1. On the AWS Glue Studio console, choose Connectors.

  2. Select your SingleStore connector and choose Create job.

    16083b576d3a1a.png

    An untitled job is created with the connection as the source node.

  3. On the Job details page, name the job. For our example: SingleStore_tpch_transform_job.

  4. For Description, enter Glue job to transform tpch data from SingleStore DB.

  5. For IAM Role, choose GlueCustomETLConnectionRole.

  6. Keep the other properties at their default.

  7. On the Visual page, on the Data source properties – Connector tab, expand Connection options.

  8. For Key, enter dbtable.

  9. For Value, enter lineitem.

    16083b576dc4c4.png

    Because AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table, you must provide the schema metadata for the data source. Use the schema editor to update the source schema. For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

  10. Choose the + icon.

  11. For Node type, choose DropFields.

    16083b576e4ac9.png
  12. On the Transform tab, select the fields to drop as shown.

  13. Choose the + icon.

  14. For Node type, choose Custom Transform.

  15. On the Transform tab, add to the custom script.

    In this example, we calculate two additional columns, disc_price and price. Then we use glueContext.write_dynamic_frame to write the updated data back on SingleStore using the connection SingleStore_connection we created. See the following code:

    def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
        from pyspark.sql.functions import col
    
        df = dfc.select(list(dfc.keys())[0]).toDF()
        df1 = df.withColumn("disc_price",(col("l_extendedprice")*(1-col("l_discount"))).cast("decimal(10,2)"))
        df2 = df1.withColumn("price", (col("disc_price")*(1+col("l_tax"))).cast("decimal(10,2)"))
        dyf = DynamicFrame.fromDF(df2, glueContext, "updated_lineitem")
    
        glueContext.write_dynamic_frame.from_options(frame = dyf,
         connection_type = "marketplace.spark",
         connection_options = {"dbtable":"updated_lineitem","connectionName":"SingleStore_connection"})
    
        return(DynamicFrameCollection({"CustomTransform0": dyf}, glueContext))
    
    16083b576ebc1e.png
  16. On the Output schema tab, add the additional columns price and disc_price created in the custom script.

    16083b576f3857.png
  17. Keep the default for node SelectFromCollection.

  18. Choose the + icon.

  19. For Node type, choose Data Target - S3.

  20. On the Data target properties – S3 tab, for Format, choose Parquet.

  21. For S3 Target Location, enter s3://aws-glue-assets-{Your Account ID as a 12-digit number}-{AWS region}/output/.

    16083b57707177.jpg
  22. Choose Save.

  23. Choose Run.

  24. A new table updated_lineitem is created with the two additional columns disc_price and price.

    16083b5770ed64.png