Load Data from AWS Glue

AWS Glue is a fully managed serverless data integration service that allows users to extract, transform, and load (ETL) from various data sources for analytics and data processing. AWS Glue runtime supports connectivity to a variety of data sources.

SingleStore provides a SingleStore connector for AWS Glue based on Apache Spark Datasource, available through AWS Marketplace. The connection supports VPC networking and integration with AWS Secrets Manager for authentication credentials.

This topic shows how you can use SingleStore’s AWS Glue Marketplace Connector in AWS Glue Studio to create ETL jobs connecting to your SingleStore cluster using an easy-to-use graphical interface.

The following architecture diagram shows SingleStore connecting with AWS Glue for an ETL job.

16083b576a82e3.png

The SingleStore connector for AWS Glue is compatible with SingleStore DB versions 7.1+.

Prerequisites

You must have admin access to the AWS account.

The SingleStore AWS Glue connector returns an error if you use the following regions in AWS: Hong Kong, Sao Paolo, Stockholm, Bahrain, and Cape Town.

Create a SingleStore cluster and load TPC-H dataset.

Configuration Overview

The following steps connect a SingleStore cluster in an AWS Glue ETL job as the source, transform the data, and store it back on a SingleStore cluster and in Apache Parquet format on Amazon S3. Here, we use the TPC-H Benchmark dataset (available as a sample dataset in SingleStore).

To successfully create the ETL job using the SingleStore connector:

  1. Store authentication credentials in Secrets Manager.

  2. Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job.

  3. Configure the SingleStore connector and connection.

  4. Create an ETL job using the SingleStore connection in AWS Glue Studio.

Store Authentication Credentials in Secrets Manager

AWS Glue provides integration with AWS Secrets Manager to securely store connection authentication credentials. To create these credentials:

  1. Log in to AWS, and select AWS Management Console > AWS services > Secrets Manager.

  2. In the Secrets dialog, select Store a new secret.

  3. Under Secret type, select Other type of secrets.

  4. In the Store a new secret dialog, under Key/value pairs, set one row for each of the following parameters:

    • ddlEndpoint

    • database

    • user

    • password

    store_new_secret_glue.png

    Select Next.

  5. In the Secret name box, enter aws-glue-singlestore-connection-info. Select Next.

  6. Select Disable automatic rotation, and then select the Next button.

  7. Select Store. Open the secret from the list. Copy the Secret ARN, and store it at a secure location.

Create an IAM Role for the AWS Glue ETL Job

To create a role with an attached policy to allow read-only access to credentials that are stored in Secrets Manager for the AWS Glue ETL job:

  1. Log in to AWS, and select AWS Management Console > AWS services > IAM.

  2. In the IAM dialog, select Policies > Create Policy.

  3. On the JSON tab, enter the following JSON snippet (update the Region and account ID from the Secret ARN):

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ],
                "Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:aws-glue-*"
            }
        ]
    }
    
  4. Select Next: Tags.

  5. Enter a name for the policy, for example, GlueAccessSecretValue.

  6. In the navigation pane, select Roles > Create role.

  7. In the Create Role dialog, under Select type of trusted entity, select AWS Service. Under Choose a use case, select Glue.

  8. Select Next: Permissions.

  9. Find and select the following AWS managed policies: AWSGlueServiceRole and AmazonEC2ContainerRegistryReadOnly.

  10. Find the policy created earlier GlueAccessSecretValue, and select it.

  11. In the Role name box, enter a role name (for example, GlueCustomETLConnectionRole).

  12. Confirm that the three policies are selected, and select Create role.

    confirm_policy_selection_glue.png

Configure the SingleStore Connector and Connection

To subscribe to the SingleStore connector and configure the connection:

  1. On the dashboard, select AWS Glue > AWS Glue Studio.

  2. In the Getting Started dialog, select View connectors.

    view_connectors_glue.png
  3. Select Go to AWS Marketplace. On the aws marketplace, subscribe to SingleStore connector for AWS Glue.

  4. Open the following link in a web browser to activate the connector from AWS Glue Studio:

    https://console.aws.amazon.com/gluestudio/home?#/connector/add-connection?connectorName=%22SingleStore%20connector%20for%20AWS%20Glue%22&connectorType=%22Spark%22&connectorDescription=%22The%20SingleStore%20connector%20for%20AWS%20Glue%20allows%20you%20to%20connect%20to%20data%20in%20AWS%20Glue%22&connectorUrl=%22https://709825985650.dkr.ecr.us-east-1.amazonaws.com/singlestore/marketplace-spark:4.0.0%22&connectorVersion=%224.0.0%22&connectorClassName=%22com.singlestore.spark%22
  5. In AWS Glue Studio, select Connectors > Create connection.

  6. In the Create connection dialog, under Connection properties, enter a name for the connection in the Name box, for example SingleStore_connection.

  7. From the AWS Secret list, select the AWS secret value aws-glue-singlestore-connection-info created earlier.

  8. Select Create connection and activate connector.

Create an ETL Job using the SingleStore Connection in AWS Glue Studio

After you have set up authentication and configured the SingleStore connector, to create an ETL job using the connection:

  1. In AWS Glue Studio, select Connectors.

  2. In the Connections dialog, select SingleStore_connection (the connection created earlier). Select Create job.

  3. On the Job details tab, in the Name box, enter a name for the job, for example SingleStore_tpch_transform_job.

  4. In the Description box, enter a description, for example Glue job to transform tpch data from SingleStore DB.

  5. From the IAM Role list, select GlueCustomETLConnectionRole.

  6. Use default settings for other properties. Select Save.

  7. On the Visual tab, in the workspace area, select the SingleStore Data source - Connection. On the Data source properties – Connector tab, expand Connection options.

    glue_select_data_source.png
  8. On the Data source properties - Connector tab, expand Connector options, and select Add new option.

  9. Under Connection options, in the Key box, enter dbtable. (In this example, we will use the lineitem table from the tpch database.) In the Value box, enter lineitem.

  10. On the Output schema tab, select Edit. Select the three dots, and select Add root key from the list.

    Note

    In this example, AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table. Hence, you must provide the schema metadata for the data source. Use the schema editor to update the source schema. For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

  11. Add the key and datatype, which represents the name of the column in the database and its datatype, respectively. Select Apply.

  12. On the Visual tab workspace area, from the Transform list, select DropFields.

  13. On the Node properties tab (for Drop Fields), from the Node parents list, select SingleStore connector for AWS Glue.

    glue_drop_fields_node_parents.png
  14. On the Transform tab, select the fields to drop.

  15. On the Visual tab workspace area, from the Transform list, select Custom Transform.

  16. On the Node properties tab (for Custom transform), from the Node parents list, select Drop Fields.

    glue_custom_transform_drop.png
  17. On the Transform tab, add the following script to the Code block box:

    def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
        from pyspark.sql.functions import col
    
        df = dfc.select(list(dfc.keys())[0]).toDF().limit(100)
        df1 = df.withColumn("disc_price",(col("l_extendedprice")*(1-col("l_discount"))).cast("decimal(10,2)"))
        df2 = df1.withColumn("price", (col("disc_price")*(1+col("l_tax"))).cast("decimal(10,2)"))
        dyf = DynamicFrame.fromDF(df2, glueContext, "updated_lineitem")
    
        glueContext.write_dynamic_frame.from_options(frame = dyf,
         connection_type = "marketplace.spark",
         connection_options = {"dbtable":"updated_lineitem","connectionName":"SingleStore_connection"})
    
        return(DynamicFrameCollection({"CustomTransform0": dyf}, glueContext))
    

    In this example, we calculate two additional columns, disc_price and price. Then we use glueContext.write_dynamic_frame to write the updated data back on SingleStore using the connection SingleStore_connection created earlier.

  18. On the Output schema tab, select Edit. Add the additional columns price and disc_price with the decimalData Type. Select Apply.

    glue_output_schema_col.png
  19. On the Visual tab, in the workspace area, select Select from Collection. On the Transform tab, from the Node parents list, select Custom transform.

  20. From the Target list, select Amazon S3. On the Node properties tab, from the Node parents list, select Select from Collection.

  21. On the Data target properties - S3 tab, from the Format list, select Parquet.

  22. In the S3 Target Location box, enter s3://aws-glue-assets-{Your Account ID as a 12-digit number}-{AWS region}/output/ or select from the list.

    glue_s3_target_location.png
  23. Select Save > Run.