# Enabling Wire Encryption and Kerberos on HDFS Pipelines

HDFS clusters can be configured in multiple ways. The ideal way to configure HDFS pipelines is to configure the Hadoop CLI client to access the cluster’s files. You can then take this configuration and transfer it into the SingleStore pipeline configuration.

In advanced HDFS pipelines mode (which can be enabled using the `advanced_hdfs_pipelines` global variable), the standard Hadoop client library is used, and Hadoop configurations from CONFIG JSON are passed to it. The configuration that works for the Hadoop CLI client works for HDFS pipelines with a few modifications.

In the advanced HDFS Pipelines mode, you can encrypt your pipeline’s connection to HDFS and you can authenticate your pipeline using Kerberos. SingleStore supports Hadoop’s Data Transfer Protocol (DTP), which encrypts your pipeline’s connection to HDFS.

This topic assumes you have already have set up your HDFS cluster to use wire encryption and/or Kerberos. For information on how to set up wire encryption, see the DTP section in the [Hadoop Secure Mode documentation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Data_Encryption_on_Block_data_transfer.). For information on how to set up your HDFS cluster to use Kerberos, see the Kerberos discussion in the [Hadoop Secure Mode documentation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html).

To create an advanced HDFS pipeline, first set the `advanced_hdfs_pipelines`[Engine Variables](https://docs.singlestore.com/db/v9.1/reference/configuration-reference/engine-variables.md) to `true` on the master aggregator. Then, run a `CREATE PIPELINE` statement and pass in JSON attributes in the `CONFIG` clause. These attributes specify how to encrypt your pipeline’s connection to HDFS, how to authenticate your pipeline using Kerberos, or both.

> **📝 Note**: With advanced HDFS pipelines, you can enable debug logging. To do so, set the engine variable `pipelines_extractor_debug_logging` engine sync variable to `true`. This setting allows your pipeline to return error messages to the client application.

## Using Kerberos Cache Support

The Kerberos cache file, also referred to as the credential cache or ticket cache, is a file that stores the Kerberos authentication tickets obtained by a client during the authentication process. This cache file is typically used by Kerberos clients to store and manage the client's credentials for future authentication requests.

SingleStore supports using the Kerberos cache to reduce the number of requests to the Kerberos servers. It uses the `kinit` utility to create the cache, and then uses the cache for authentication. By default, it creates the cache in the `/tmp/s2_krb5cc` file. However, that can be changed by editing the pipeline configuration. The ability to reuse Hadoop logins is also implemented, to reduce the number of requests.

Begin by setting `advanced_hdfs_pipelines` to `ON`, which is required to access Kerberized HDFS.

To enable reusing Hadoop logins, add the following lines to the CONFIG section JSON:

```
"allow_unknown_configs": true,
"kerberos.allow_login_reuse": true
```

It works best when all HDFS pipelines are using the same config.

Edit the [CREATE PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/create-pipeline.md) syntax to modify the configuration for an existing pipeline. This will keep the information about already ingested files, unlike when creating a new pipeline.

Setting up caching is a little more complicated. Do the following:

1. Install the Kerberos CLI client tools on all nodes.

2. On each node, set up the `/etc/krb5.conf` file for the Kerberos server that is associated with the Hadoop cluster. The `kinit -kt KEYTAB_FILE HADOOP_USER` command should work with the keytab file and the Hadoop user used in a HDFS pipeline.

3. Modify the pipeline config for it to include:
   ```sql
   "allow_unknown_configs": true,
   "kerberos.use_cache": true
   ```

Alternatively, a [CREATE PIPELINE](https://docs.singlestore.com/db/v9.1/reference/sql-reference/pipelines-commands/create-pipeline.md) query may be used with the same definitions as in the initial CREATE but with a modified CONFIG section.

Some additional settings are available for Kerberos cache:

* `"kerberos.kinit_path"`: Can be used to set the path to the `kinit` executable, if it is not in `$PATH`.
* `"kerberos.cache_path"`: Path for the Kerberos cache, which is `/tmp/s2_krb5cc` by default.
* `"kerberos.cache_renewal_interval"`: Renewal interval for the cache, in seconds. The default interval is one hour. When the cache is older than this, new logins are performed via keytab.

## Wire Encryption

If encrypted DTP is enabled in your HDFS cluster, you can encrypt your pipeline’s connection to HDFS. To do this, create your `CONFIG` JSON that you will use in `CREATE PIPELINE` as follows:

1. Set `dfs.encrypt.data.transfer` to `true`.

2. Set the attributes `dfs.encrypt.data.transfer.cipher.key.bitlength`, `dfs.encrypt.data.transfer.algorithm`, and `dfs.data.transfer.protection`. Set these attribute’s values as they are specified your `hdfs-site.xml` file. Find a copy of this file on each node in your HDFS cluster.

The following example creates a pipeline that uses encrypted DTP to communicate with HDFS.

```sql
CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/path/to/files'
CONFIG '{
    "dfs.encrypt.data.transfer": true,
    "dfs.encrypt.data.transfer.cipher.key.bitlength": 256,
    "dfs.encrypt.data.transfer.algorithm": "rc4",
    "dfs.data.transfer.protection": "authentication"
}'
INTO TABLE `my_table`
FIELDS TERMINATED BY '\t';

```

## Authenticating with Kerberos

You can create an HDFS pipeline that authenticates with Kerberos. Prior to doing so, perform the following installation steps on every SingleStore leaf node. These steps use `EXAMPLE.COM` as the default realm and `host.example.com` as the fully qualified domain name (FQDN) of the KDC server.

> **📝 Note**: Perform the following steps on every SingleStore leaf node (referred to below as the “node”). An exception is step three; perform this step on the KDC server, only.

1. Install version 1.8 or later of the Java Runtime Environment (JRE). The JRE version installed should match the JRE version installed on the HDFS nodes.

2. Tell SingleStore the path where the JRE binary files have been installed. An example path is `/usr/bin/java/jre1.8.2_12/bin`. Specify the path using one of the two following methods:

   *Method 1:* Add the path to your operating system’s `PATH` environment variable.

   *Method 2:* Set the engine variables `java_pipelines_java_path` and `java_pipelines_java_home` to the path.

3. On the KDC server, create a SingleStore service principal (e.g. `memsql/host.example.com@EXAMPLE.COM`) and a keytab file containing the SingleStore service principal.

4. Securely copy the keytab file containing the SingleStore service principal from the KDC server to the node. You should use a secure file transfer method, such as `scp`, to copy the keytab file to your node. The file location on your node should be consistent across all nodes in the cluster.

5. Ensure that the Linux service account used to run SingleStore on the node can access the copied keytab file. This can be accomplished by changing file ownership or permissions. If this account cannot access the keytab file, you will not be able to complete the next step because your master aggregator will not be able to restart after applying configuration updates.

6. When authenticating with Kerberos, SingleStore needs to authenticate as a client, which means you must also install a Kerberos client on your node.

   The following command installs the client on Debian-based Linux distributions.
   ```shell
   sudo apt-get update && apt-get install krb5-user

   ```
   The following command installs the client on RHEL/CentOS:
   ```shell
   yum install krb5-workstation

   ```

7. Configure your Kerberos client to connect to the KDC server. In your node’s `/etc/krb5.conf` file, set your default realm, Kerberos admin server, and other options to those defined by your KDC server.

8. Make sure your node can connect to the KDC server using the fully-qualified domain name (FQDN) of the KDC server. This FQDN is found in the `/etc/krb5.conf` file. This might require configuring network settings or updating `/etc/hosts` on your node.

9. Ensure that your node can access every HDFS datanode, using the FQDN or IP by which the HDFS namenode accesses the datanode. The FQDN is typically used.

10. Specify the path of your keytab file in the `kerberos.keytab` attribute of your `CONFIG` JSON that you will pass to your `CREATE PIPELINE` statement.

11. In your `CONFIG` JSON, add the attributes `dfs.datanode.kerberos.principal` and `dfs.namenode.kerberos.principal`. Set these attribute’s values as they are specified your `hdfs-site.xml` file. Find a copy of this file on each node in your HDFS cluster.

## Hadoop RPC Protection

To securely access a kerberized Hadoop cluster and load from HDFS, set "hadoop.rpc.protection" in your `CREATE PIPELINE` configuration to match the QOP settings specified in your core-site.xml file. Find a copy of this file on each node in your HDFS workspace. Example values: \``"hadoop.rpc.protection": "authentication,privacy"` or `"hadoop.rpc.protection": "privacy"`. See the [example CREATE PIPELINE statement](https://docs.singlestore.com/#UUID-02e7b3da-562d-914d-4ee6-c40daac9561b.md) for syntax.

## Example `CREATE PIPELINE` Statement Using Kerberos

The following example demonstrates how to create an HDFS pipeline that authenticates using Kerberos. Assume that port 8020 is the HDFS endpoint.&#x20;

```sql
CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/path/to/files'
CONFIG '{
    "hadoop.security.authentication": "kerberos",
    "kerberos.user": "memsql/host.example.com@EXAMPLE.COM",
    "kerberos.keytab": "/path/to/kerberos.keytab",
    "hadoop.rpc.protection": "authentication,privacy",
    "dfs.client.use.datanode.hostname": true,
    "dfs.datanode.kerberos.principal": "datanode_principal/_HOST@EXAMPLE.COM",
    "dfs.namenode.kerberos.principal": "namenode_principal/_HOST@EXAMPLE.COM"
}'
INTO TABLE `my_table`
FIELDS TERMINATED BY '\t';

```

***

Modified at: February 25, 2025

Source: [/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/enabling-wire-encryption-and-kerberos-on-hdfs-pipelines/](https://docs.singlestore.com/db/v9.1/load-data/data-sources/load-data-from-hdfs-using-a-pipeline/enabling-wire-encryption-and-kerberos-on-hdfs-pipelines/)

(An index of the documentation is available at /llms.txt)
