Load Data from AWS Glue
On this page
AWS Glue is a fully managed serverless data integration service that allows users to extract, transform, and load (ETL) data from various sources for analytics and data processing.
SingleStore provides a connector for AWS Glue based on Apache Spark Datasource, available through AWS Marketplace.
The following architecture diagram shows SingleStore connecting with AWS Glue for an ETL job.
Prerequisites
-
AWS Glue version 5.
0+. -
Admin access to the AWS account.
-
An active SingleStore workspace with a sample dataset.
The example uses the
lineitemtable from thetpchdatabase.Refer to Load TPC-H Data into SingleStore. Alternatively, you can use another data set stored in your SingleStore database.
Note
The SingleStore AWS Glue connector returns an error if used in the following AWS regions: Hong Kong, Sao Paolo, Stockholm, Bahrain, and Cape Town.
Configuration Overview
The following steps connect a SingleStore workspace in an AWS Glue ETL job as the source, transform the data, and store it back on the following:
-
SingleStore database
-
Amazon S3 in parquet format
To create an ETL job using the SingleStore connector:
-
Store authentication credentials in Secrets Manager.
-
Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job.
-
Configure the SingleStore connector and connection.
-
Create an ETL job using the SingleStore connection in AWS Glue Studio.
Store Authentication Credentials in Secrets Manager
AWS Glue provides integration with AWS Secrets Manager to securely store connection authentication credentials.
-
Log in to AWS, and open Secrets Manager (search Secrets Manager on the AWS console and select it from the results).
-
Select Store a new secret.
-
Under Secret type, select Other type of secrets.
-
Under Key/value pairs, set one row for each of the following parameters:
-
ddlEndpoint: IP address or hostname of the SingleStore workspace. -
database: Name of the SingleStore database to connect with. -
user: Username of the SingleStore database user with which to connect. -
password: Password for the SingleStore database user.
-
-
Select Next.
-
In the Secret name field, enter a name for the secret, for example,
aws-glue-singlestore-connection-info. -
Select Next.
-
Disable Automatic rotation, and then select the Next button.
-
Review the secret configuration and then select Store.
-
Copy and securely store the Secret ARN.
Create an IAM Role for the AWS Glue ETL Job
To create a role with an attached policy to allow read-only access to credentials that are stored in Secrets Manager for the AWS Glue ETL job:
-
Log in to AWS, and open IAM (search IAM in the AWS console and select it from the results).
-
Select Policies > Create Policy.
-
Select JSON.
On the JSON tab, enter the following JSON snippet (update the Region and account ID from the secret ARN): {"Version": "2012-10-17","Statement": [{"Sid": "VisualEditor0","Effect": "Allow","Action": ["secretsmanager:GetSecretValue","secretsmanager:DescribeSecret"],"Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:aws-glue-*"}]} -
Select Next.
-
Under Policy name, enter a name for the policy, for example, GlueAccessSecretValue.
-
Select Create policy.
-
On the IAM Dashboard, select Roles > Create role.
-
Under Trusted entity type, select AWS Service.
Under Use case, select Glue. -
Select Next.
-
Find and select the following AWS managed policies:
AWSGlueServiceRoleandAmazonEC2ContainerRegistryReadOnly. -
Find the policy created earlier
GlueAccessSecretValue, and select it. -
Select Next.
-
In the Role name field, enter a role name, for example, GlueCustomETLConnectionRole.
-
Confirm that the three policies are selected, and select Create role.
Configure the SingleStore Connector and Connection
To subscribe to the SingleStore connector and configure the connection:
-
On the AWS dashboard, select AWS Glue Studio (search Glue Studio on the AWS console and select it from the results).
-
In the left navigation pane, select Data Connections.
-
Select Go to AWS Marketplace.
On the AWS Marketplace, search and select SingleStore connector for AWS Glue. -
Select View purchase options > Subscribe.
-
Once the subscription request is successful, select View subscription > SingleStore connector for AWS Glue > Usage instructions.
-
On the Usage Instructions dialog, open the Activate the Glue connector using AWS Glue Studio link to activate the connector.
-
On the AWS Glue Studio console, select Data connections in the left navigation pane.
-
In the Connections section, select Create connection.
-
In the Create connection dialog, under Connection properties, enter a name for the connection in the Name field, for example
SingleStore_.connection -
From the AWS Secret list, select the AWS secret value
aws-glue-singlestore-connection-infocreated earlier. -
Select Create connection and activate connector.
Create an ETL Job using the SingleStore Connection in AWS Glue Studio
The example uses the lineitem table from the tpch database.
After authentication is set up and the SingleStore connector is configured, create an ETL job using the connection:
-
In the AWS Glue Studio console, select Data Connection in the left navigation pane.
-
In the Connections dialog, select SingleStore_
connection (the connection created earlier). Select Create job. -
On the Job details tab, in the Name field, enter a name for the job, for example, SingleStore_
transform_ job. -
In the Description field, enter a description, for example, Glue job to transform tpch data from SingleStore Helios.
-
From the IAM Role list, select GlueCustomETLConnectionRole.
-
From the Glue version list, select a Glue version.
-
Use default settings for other properties.
Select Save. -
On the Visual tab, in the workspace area, select the SingleStore connection.
On the Data source properties – Connector tab, expand Connection options, and select Add new option.
-
Under Connection options, in the Key field, enter dbtable.
In the Value field, enter lineitem. -
On the Output schema tab, select Edit.
Select the three dots, and select Add root key from the list. Note
In this example, AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table.
Hence, you must provide the schema metadata for the data source. Use the schema editor to update the source schema. For instructions on how to use the schema editor, refer to Editing the schema in a custom transform node. -
Add the key and data type, which represent the name of the column in the database and its data type, respectively.
Select Apply. -
On the Visual tab workspace area, select Add nodes, and then select DropFields from the list.
-
On the Transform tab (for Drop Fields), from the Node parents list, select SingleStore connector for AWS Glue.
-
Select the fields to drop.
-
On the Visual tab workspace area, select Add node, and then select Custom Transform from the list.
-
From the Node parents list, select Drop Fields.
-
Add the following script to Code block:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:from pyspark.sql.functions import coldf = dfc.select(list(dfc.keys())[0]).toDF().limit(100)df1 = df.withColumn("disc_price",(col("l_extendedprice")*(1-col("l_discount"))).cast("decimal(10,2)"))df2 = df1.withColumn("price", (col("disc_price")*(1+col("l_tax"))).cast("decimal(10,2)"))dyf = DynamicFrame.fromDF(df2, glueContext, "updated_lineitem")glueContext.write_dynamic_frame.from_options(frame = dyf,connection_type = "marketplace.spark",connection_options = {"dbtable":"updated_lineitem","connectionName":"SingleStore_connection"})return(DynamicFrameCollection({"CustomTransform0": dyf}, glueContext))This example calculates two additional columns,
disc_andprice price.It uses glueContext.to write the updated data back to SingleStore using the connection SingleStore_write_ dynamic_ frame connection created earlier. Note
You can specify any of the SingleStore Spark connector configuration settings in
connection_.options For example, the following statement (from the script) is extended to specify a shard key: glueContext.write_dynamic_frame.from_options(frame = dyf, connection_type = "marketplace.spark", connection_options = {"dbtable":"updated_lineitem","connectionName":"SingleStore_connection", "tableKey.shard" : "l_partkey"}) -
On the Output schema tab, select Edit.
Add the additional columns priceanddisc_with theprice decimaldata type.Select Apply. -
On the Visual tab, in the workspace area, select Add nodes > Select from Collection.
-
On the Transform tab, from the Node parents list, select Custom transform.
-
Select Add nodes.
From the Target list, select Amazon S3. -
From the Node parents list, select Select from Collection.
-
On the Data target properties - S3 tab, from the Format list, select Parquet.
-
In the S3 Target Location field, enter
s3://aws-glue-assets-{Your Account ID as a 12-digit number}-{AWS region}/output/or select from the list.
-
Select Save > Run.
The transformed data is now stored in Amazon S3 and the SingleStore database.
Last modified: