Load Data from the Filesystem Using a Pipeline

Note

The filesystem pipelines feature is not supported by the SingleStore (MemSQL) Operator or the SingleStore database engine in Kubernetes environments.

Prerequisites

The following is a list of the minimum requirements needed to load data from the filesystem using a pipeline:

  • Filesystem access to obtain the data.

  • A SingleStore database.

  • A table with corresponding columns based on the data to be loaded.

Create a Database and a Filesystem Pipeline

The following sections walk through creating a local directory, a text file, a database, a table, and a pipeline for ingesting data.

  1. In a terminal window, create a directory and change to the created directory. In this step the directory is called test_directory.

    mkdir test_directory
    cd test_directory

    Note

    The directory you create will need to be in the path specified by the secure_file_priv engine variable. Import and export operations are only allowed on files in the directory specified by secure_file_priv or its subdirectories.

  2. Create a file called books.txt and cut and paste in the information below. Save the file using Ctrl-D.

    cat > books.txt
    The Catcher in the Rye, J.D. Salinger, 1945
    Pride and Prejudice, Jane Austen, 1813
    Of Mice and Men, John Steinbeck, 1937
    Frankenstein, Mary Shelley, 1818
    Ctrl-D
  3. All the steps including this one will be run from a SingleStore prompt. Create a database and use the database.

    CREATE DATABASE books;
    USE books;
  4. This statement creates a new table titled classic_books, which contains three columns: title, author, and date.

    CREATE TABLE classic_books
    (
    title VARCHAR(255),
    author VARCHAR(255),
    date VARCHAR(255)
    );
  5. Now that the destination database and table have been created, the pipeline can be created. The following information is needed to create a pipeline:

    CREATE PIPELINE library AS
    LOAD DATA FS '/absolute/path/to/test_directory'
    INTO TABLE classic_books
    FIELDS TERMINATED BY ',';

    You can see what files the pipeline will attempt to load by running the following:

    SELECT * FROM information_schema.PIPELINES_FILES;
    +---------------+---------------+-------------+-------------------------+-----------+------------+
    | DATABASE_NAME | PIPELINE_NAME | SOURCE_TYPE | FILE_NAME               | FILE_SIZE | FILE_STATE |
    +---------------+---------------+-------------+-------------------------+-----------+------------+
    | books         | library       | FS          | /tmp/books.txt          |       154 | Unloaded   |
    +---------------+---------------+-------------+-------------------------+-----------+------------+
  6. If properly configured, there will be one row in the Unloaded state for the books.txt file. The CREATE PIPELINE statement will create a new pipeline named library. The pipeline has not been started, and data has not been loaded.

    A SingleStore pipeline can run either in the background or foreground. Starting the pipeline in the foreground allows you to see any errors that may occur during the execution. Starting the pipeline in the background allows continuous data ingest into the specified table.

    START PIPELINE library FOREGROUND;
  7. Once the pipeline has started, the contents of the books.txt file should be loaded. Check the information_schema.PIPELINES_FILES, all files should be in the Loaded state. Query the classic_books table to verify the data has loaded.

    SELECT * FROM classic_books;
    +------------------------+-----------------+-------+
    | title                  | author          | date  |
    +------------------------+-----------------+-------+
    | The Catcher in the Rye |  J.D. Salinger  |  1945 |
    | Pride and Prejudice    |  Jane Austen    |  1813 |
    | Of Mice and Men        |  John Steinbeck |  1937 |
    | Frankenstein           |  Mary Shelley   |  1818 |
    +------------------------+-----------------+-------+

Running a pipeline in the background results in a periodic polling of the directory for new files and will continuously add them to the directory. Before running a pipeline in the background, reset the pipeline and table.

The first query deletes all rows from the target table. The second query will cause the pipeline to start from the beginning. In this example, the pipeline will forget it already loaded books.txt. An alternate method is to drop and recreate the pipeline.

DELETE FROM classic_books;
ALTER PIPELINE library SET OFFSETS EARLIEST;

To start a pipeline in the background, use:

START PIPELINE library;

This statement starts the pipeline. To see whether the pipeline is running, run SHOW PIPELINES.

SHOW PIPELINES;
+--------------------+---------+-----------+
| Pipelines_in_books | State   | Scheduled |
+--------------------+---------+-----------+
| library            | Running | False     |
+--------------------+---------+-----------+

At this point, the pipeline is running and the contents of the books.txt file should once again be present in the classic_books table.

Note

Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see START PIPELINE.

Next Steps

See Load Data with Pipelines to learn more about how pipelines work.

Last modified: January 23, 2024

Was this article helpful?