Load Data from Google Cloud Storage (GCS) Using a Pipeline
On this page
SingleStore Pipelines can extract objects from Google Cloud Storage, optionally transform them, and insert them into a destination table.
Prerequisites
The following prerequisites are the needed to create a GCS pipeline.
-
GCS Account: Requires a Google
access_
andid secret_
.key -
SingleStore installation –or– a SingleStorecluster: You will connect to the database or cluster and create a pipeline to pull data from your GCS bucket.
Part 1: Creating a GCS Bucket and Adding a File
-
On your local machine, create a text file with the following CSV contents and name it books.
txt: The Catcher in the Rye, J.D. Salinger, 1945 Pride and Prejudice, Jane Austen, 1813 Of Mice and Men, John Steinbeck, 1937 Frankenstein, Mary Shelley, 1818
-
In GCS, create a bucket and upload
books.
to the bucket.txt Also, create an HMAC key for authentication to the bucket, as SingleStore Pipelines only support that type of authentication to GCS. For information on working with GCS, refer to the Google Cloud Storage Documentation.
Part 2: Creating a SingleStore Database and GCS Pipeline
Now that you have a GCS bucket that contains an object (file), you can use SingleStore to create a new pipeline and ingest the messages.
Create a new database and a table that adheres to the schema contained in the books.
CREATE DATABASE books;
CREATE TABLE classic_books(title VARCHAR(255),author VARCHAR(255),date VARCHAR(255));
These statements create a new database named books
and a new table named classic_
, which has three columns: title
, author
, and date
.
Now that the destination database and table have been created, you can create a GCS pipeline.
-
The name of the bucket, such as: my-bucket-name
-
Your Google account’s access HMAC keys, such as: Access Key ID:
your_
Secret Access Key:access_ key_ id your_
.secret_ access_ key
Using these identifiers and keys, execute the following statement, replacing the placeholder values with your own:
CREATE PIPELINE libraryAS LOAD DATA GCS 'my-bucket-name'CREDENTIALS '{"access_id": "your_access_key_id", "secret_key": "your_secret_access_key"}'INTO TABLE `classic_books`FIELDS TERMINATED BY ',';
You can see what files the pipeline wants to load by running the following:
SELECT * FROM information_schema.PIPELINES_FILES;
If everything is properly configured, you should see one row in the Unloaded
state, corresponding to books.
.CREATE PIPELINE
statement creates a new pipeline named library
, but the pipeline has not yet been started, and no data has been loaded.
START PIPELINE library FOREGROUND;
When this command returns success, all files from your bucket will be loaded.information_
again, you should see all files in the Loaded
state.classic_
table to make sure the data has actually loaded.
SELECT * FROM classic_books;
+------------------------+-----------------+-------+
| title | author | date |
+------------------------+-----------------+-------+
| The Catcher in the Rye | J.D. Salinger | 1945 |
| Pride and Prejudice | Jane Austen | 1813 |
| Of Mice and Men | John Steinbeck | 1937 |
| Frankenstein | Mary Shelley | 1818 |
+------------------------+-----------------+-------+
You can also have SingleStore run your pipeline in the background.
DELETE FROM classic_books;ALTER PIPELINE library SET OFFSETS EARLIEST;
The first command deletes all rows from the target table.books.
so you can load it again.
To start a pipeline in the background, run
START PIPELINE library;
This statement starts the pipeline.SHOW PIPELINES
.
SHOW PIPELINES;
+----------------------+---------+
| Pipelines_in_books | State |
+----------------------+---------+
| library | Running |
+----------------------+---------+
At this point, the pipeline is running and the contents of the books.classic_
table.
Note
Foreground pipelines and background pipelines have different intended uses and behave differently.
Related Topics
See Load Data with Pipelines to learn more about how pipelines work.
Last modified: September 26, 2023