Load Data from Google Cloud Storage (GCS) Using a Pipeline

Prerequisites

The following prerequisites are the needed to create a GCS pipeline.

GCS Account: Requires a Google access_id and secret_key.
SingleStore installation –or– a SingleStorecluster: You will connect to the database or cluster and create a pipeline to pull data from your GCS bucket.

Part 1: Creating a GCS Bucket and Adding a File

On your local machine, create a text file with the following CSV contents and name it books.txt:

The Catcher in the Rye, J.D. Salinger, 1945
Pride and Prejudice, Jane Austen, 1813
Of Mice and Men, John Steinbeck, 1937
Frankenstein, Mary Shelley, 1818

In GCS, create a bucket and upload books.txt to the bucket. Also, create an HMAC key for authentication to the bucket, as SingleStore Pipelines only support that type of authentication to GCS.

For information on working with GCS, refer to the Google Cloud Storage Documentation.

Part 2: Creating a SingleStore Database and GCS Pipeline

Now that you have a GCS bucket that contains an object (file), you can use SingleStore to create a new pipeline and ingest the messages. In this part of the Quickstart, you will create new GCS pipeline and use it to load your CSV data into the database.

Create a new database and a table that adheres to the schema contained in the books.txt file. At the MemSQL prompt, execute the following statements:

SQL

CREATE DATABASE books;

SQL

CREATE TABLE classic_books
(
title VARCHAR(255),
author VARCHAR(255),
date VARCHAR(255)
);

These statements create a new database named books and a new table named classic_books, which has three columns: title, author, and date.

Now that the destination database and table have been created, you can create a GCS pipeline. Use the books.txt file previously uploaded to your bucket. To create the pipeline, you will need the following information:

The name of the bucket, such as: my-bucket-name
Your Google account’s access HMAC keys, such as: Access Key ID: your_access_key_id Secret Access Key: your_secret_access_key.

Using these identifiers and keys, execute the following statement, replacing the placeholder values with your own:

SQL

CREATE PIPELINE library
AS LOAD DATA GCS 'my-bucket-name'
CREDENTIALS '{"access_id": "your_access_key_id", "secret_key": "your_secret_access_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

You can see what files the pipeline wants to load by running the following:

SQL

SELECT * FROM information_schema.PIPELINES_FILES;

If everything is properly configured, you should see one row in the Unloaded state, corresponding to books.txt. The CREATE PIPELINE statement creates a new pipeline named library, but the pipeline has not yet been started, and no data has been loaded. A pipeline can run either in the background or be triggered by a foreground query. Start it in the foreground first.

START PIPELINE library FOREGROUND;

When this command returns success, all files from your bucket will be loaded. If you check information_schema.PIPELINES_FILES again, you should see all files in the Loaded state. Now query the classic_books table to make sure the data has actually loaded.

SQL

SELECT * FROM classic_books;

+------------------------+-----------------+-------+
| title                  | author          | date  |
+------------------------+-----------------+-------+
| The Catcher in the Rye |  J.D. Salinger  |  1945 |
| Pride and Prejudice    |  Jane Austen    |  1813 |
| Of Mice and Men        |  John Steinbeck |  1937 |
| Frankenstein           |  Mary Shelley   |  1818 |
+------------------------+-----------------+-------+

You can also have SingleStore run your pipeline in the background. In such a configuration, SingleStore will periodically poll GCS for new files and continuously them as they are added to the bucket. Before running your pipeline in the background, you must reset the state of the pipeline and the table.

SQL

DELETE FROM classic_books;
ALTER PIPELINE library SET OFFSETS EARLIEST;

The first command deletes all rows from the target table. The second causes the pipeline to start from the beginning, in this case, forgetting it already loaded books.txt so you can load it again. You can also drop and recreate the pipeline, if you prefer.

To start a pipeline in the background, run

SQL

START PIPELINE library;

This statement starts the pipeline. To see whether the pipeline is running, run SHOW PIPELINES.

SQL

SHOW PIPELINES;

+----------------------+---------+
| Pipelines_in_books   | State   |
+----------------------+---------+
| library              | Running |
+----------------------+---------+

At this point, the pipeline is running and the contents of the books.txt file should once again be present in the classic_books table.

Note

Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see the START PIPELINE topic.

See Load Data with Pipelines to learn more about how pipelines work.

Load Data from Google Cloud Storage (GCS) Using a Pipeline

On this page

Prerequisites

Part 1: Creating a GCS Bucket and Adding a File

Part 2: Creating a SingleStore Database and GCS Pipeline

Was this article helpful?

On this page

Was this article helpful?

Load Data from Google Cloud Storage (GCS) Using a Pipeline

On this page

Prerequisites

Part 1: Creating a GCS Bucket and Adding a File

Part 2: Creating a SingleStore Database and GCS Pipeline

Related Topics

Was this article helpful?

On this page

Was this article helpful?