# Load Data from HDFS Using a Pipeline

When you use HDFS (A distributed file system by Apache Hadoop. It is highly fault tolerant and designed to run on COTS (Commercial Off the Shelf) or low cost out-of-the-box hardware.) pipelines, you extract data from an HDFS file path, optionally transform the data, and load it to a SingleStore table.

With [Enabling Wire Encryption and Kerberos on HDFS Pipelines](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/enabling-wire-encryption-and-kerberos-on-hdfs-pipelines.md), you can encrypt your pipeline’s connection to HDFS and you can authenticate your pipeline using Kerberos. SingleStore supports Data Transfer Protocol (DTP), which encrypts your pipeline’s connection to HDFS.

This topic assumes that you have set up HDFS in your cluster and you are familiar with [how HDFS works](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html).

> **📝 Note**: HDFS Pipelines cannot run Hadoop jobs.

## Creating and Starting an HDFS Pipeline

You create an HDFS pipeline by running the [CREATE PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/create-pipeline.md) statement. You start a pipeline by running [START PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/start-pipeline.md)

## HDFS Pipeline Scenario

Imagine that your organization has numerous applications running on-premises. These applications generate lengthy log files that contain possible errors the applications generate as they run.

Your goal: Get a running weekly and monthly count of the number of errors per application.

To accomplish this, you follow these steps sequentially. The steps are explained in detail in the sections below.

1. Copy the first day’s application log files into HDFS.

2. Run a Hadoop job that processes the log files and outputs the results to one output file.

3. Use an HDFS pipeline to extract the results and import them into a SingleStore table.

4. Confirm that the SingleStore table contains the data for the first day. If the data is correct, run steps one through three continuously, at the end of every day.

> **📝 Note**: This scenario provides a hypothetical example of how you can use HDFS pipelines. It is a not a working example, as the application logs and the Hadoop job are not available.

## Log File Format

Shown below is an example application log file containing selected errors from one day. The `...` parts indicate lengthy sections containing errors that are removed for brevity.

```
...
[2019-03-01 09:30:08]: ERR-2030: File not found: file1.txt
...
[2019-03-01 12:15:35]: ERR-1010: Could not read configuration file conf_file.txt
...
[2019-03-01 14:00:10]: ERR-1520: Not enough memory available to open file file2.txt
...
[2019-03-01 16:40:35]: ERR-1010: Could not read configuration file conf_file10.txt
...
[2019-03-01 19:20:55]: ERR-1520: Not enough memory available to open file file15.txt
...

```

Your other applications generate log files that are formatted in a similar way. The errors contain the date and time of the error, the error code and the error message.

## HDFS folder Setup

The HDFS folder `/user/memsql/<application name>` stores the input files, per application, that are processed by your Hadoop job. After your job runs, it outputs the resulting `part-00000` file to the HDFS folder `/user/memsql/output`. The `part-00000` filename has the current date appended to the end.

## Running the Hadoop Job

Run the Hadoop job which extracts the error date and error code for each log entry. The job will write these fields, along with the application name, to a line in the output file.

An example output file is shown below.

```
App1, ERR-2030, 2019-03-01
App1, ERR-1010, 2019-03-01
App1, ERR-1520, 2019-03-01
App1, ERR-1010, 2019-03-01
App1, ERR-1520, 2019-03-01
App2, E-400, 2019-03-01
App2, E-250, 2019-03-01
App2, E-800, 2019-03-01
App2, E-400, 2019-03-01

```

## Creating the Destination Table

Create the table where the data will be loaded from the pipeline.

```sql
CREATE TABLE app_errors (app_name TEXT, error_code TEXT, error_date DATE, SORT KEY (error_date));

```

The reason you use a columnstore table to store the application errors is because a columnstore is well suited for performing aggregate queries. Also, you will not be updating rows in the table once they are imported.

## Creating Your Pipeline

Use the following statement to create a new pipeline named `my_pipeline`, where you reference the HDFS path `/memsql/output/` as the data source, and import data into the `app_errors` table once the pipeline is started.

```sql
CREATE PIPELINE my_pipeline AS
LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/memsql/output/'
INTO TABLE app_errors
FIELDS TERMINATED BY ', ';

```

Should your `CREATE PIPELINE` statement fail, run `SHOW WARNINGS`. This command may provide information as to why the `CREATE PIPELINE` statement failed.

## Testing Your Pipeline

After you run your `CREATE PIPELINE` statement successfully, run the tests in this section to confirm your pipeline is working.

The following query should return one row for each file that the pipeline has imported. In this case, the query will only return one row, since the Hadoop job generates one output file. For example, if you ran the job on January 1, 2019, `FILE_NAME` could have the value `part-00000-20190101`. The `FILE_STATE` field value should have the value `Unloaded`.

```sql
SELECT * FROM information_schema.pipelines_files WHERE pipeline_name = 'my_pipeline';

```

Run the following command to test if the master aggregator can connect to the HDFS namenode. If the test is successful, zero rows will be returned.

```sql
TEST PIPELINE my_pipeline LIMIT 0;

```

Run the following command to test if the SingleStore leaf nodes can connect to the HDFS datanodes. If the test is successful, one row will be returned.

```sql
TEST PIPELINE my_pipeline LIMIT 1;

```

## Starting Your Pipeline

Assuming the tests you ran in the previous section succeeded, start your pipeline in the foreground:

```sql
START PIPELINE my_pipeline FOREGROUND LIMIT 1 BATCHES;

```

Starting your pipeline in the foreground allows you to see any errors, if they occur. After your pipelines ingests one batch of data into the `app_errors` table, your pipeline stops.

Run the following query, which should return one row with the `part-00000-20190101` file. In this row, the `FILE_STATE` field should have the value `Loaded`, assuming no errors occurred when you started the pipeline.

```sql
SELECT * FROM information_schema.pipelines_files WHERE pipeline_name = 'my_pipeline';

```

Run `SELECT * FROM app_errors;` to view the records in the `app_errors` table. If your HDFS file `/user/memsql/output/part-00000-20190101` contains the output as shown in the [Running the Hadoop Job](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/#UUID-5d36d75c-39cb-cb2d-2646-c6dfcce4dcfb.md) above, you will see nine records, each with an `app_name`, `error_code` and `error_date` field.

## Syncing HDFS with Your Application Logs

Now that your Hadoop job generates an output file successfully and your pipeline imports the file successfully, you want to periodically copy your application log files to the `/user/memsql/<application name>` HDFS folders. To accomplish this, you write a script to automatically copy the files at the end of every day.

After you start the script, start your pipeline in the background to continuously ingest data into the `app_errors` table.

```sql
START PIPELINE my_pipeline;

```

> **📝 Note**: Foreground pipelines and background pipelines have different intended uses and behave differently. For more information, see [START PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/start-pipeline.md).

## Finding the Weekly and Monthly Error Count

Recall that your goal is to find a running weekly and monthly count of the number of errors each application generates. Prior to creating a weekly and monthly query that returns these counts, you write a simpler query that returns the number of errors per application and error code, for all records in the table.

```sql
SELECT COUNT(*), app_name, error_code FROM app_errors
GROUP BY app_name, error_code

```

Using the previous query as a starting point, you write the monthly query:

```sql
SELECT COUNT(*), app_name, error_code, MONTH(error_date) FROM app_errors
GROUP BY app_name, error_code, MONTH(error_date)

```

Finally, you write the weekly query, using `WEEK(error_date, 2)` to specify that the week begins on Sunday:

```sql
SELECT COUNT(*), app_name, error_code, WEEK(error_date, 2) FROM app_errors
GROUP BY app_name, error_code, WEEK(error_date, 2)

```

## Next Steps

See [About SingleStore Pipelines](https://docs.singlestore.com/db/v9.1/load-data/about-singlestore-pipelines.md) to learn more about how pipelines work.

## In this section

* [Enabling Wire Encryption and Kerberos on HDFS Pipelines](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/enabling-wire-encryption-and-kerberos-on-hdfs-pipelines.md)

***

Modified at: July 24, 2025

Source: [/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/)

(An index of the documentation is available at /llms.txt)
