LOAD DATA

On this page

Important

SingleStore only supports LOAD DATA with the LOCAL option. As this SQL command will not work in the SQL Editor, LOAD DATA LOCAL must be run from a SQL client running on a computer that can access your SingleStore instance, such as the MySQL client (mysql-client) or SingleStore client. Refer to Connect to Your Cluster for more information on SQL clients.

Important

SingleStore clusters can be integrated with many third-party ETL and CDC tools.

Import data stored in a CSV, JSON, or Avro file into a SingleStore table (referred to as the destination table in this topic).

Remarks

The syntax and semantics of loading data from a CSV, JSON, or Avro file are detailed below.

IGNORE, SKIP PARSER ERRORS, and SKIP ALL ERRORS are unsupported with non-CSV pipelines.

REPLACE, SKIP CONSTRAINT ERRORS, and SKIP DUPLICATE KEY ERRORS are supported with non-CSV pipelines.

During the import of data stored in any of these files, you can optionally apply operations to the data as follows:

  • Use the WHERE clause to do filtering on incoming data. Only rows that satisfy the expression in the WHERE clause will be loaded into SingleStore. For an example of how to use the WHERE clause, see the examples section.

  • Use the SET clause to set columns using specific values or expressions with variables. For example, if your input file has 9 columns but the table has a 10th column called foo, you can add SET foo=0 or SET foo=@myVariable. Note that column names may only be used on the left side of SET expressions.

  • Use the CHARACTER SET clause to import files with any supported character set into SingleStore.

    For more information, see Character Encoding.

Refer to the Permission Matrix for the required permission.

Important

If a query uses @ in a LOAD DATA statement, SingleStore interprets it as a reference to a LOAD DATA assignment to a variable, not as a reference to a user-defined variable.

The behavior of SingleStore’s LOAD DATA command has several functional differences from MySQL’s command:

  • LOAD DATA will load the data into SingleStore in parallel to maximize performance. This makes LOAD DATA in SingleStore much faster on machines with a larger number of processors.

  • LOAD DATA supports loading compressed .gz files.

  • The only supported charset_name is utf8.

  • While it is possible to execute LOAD DATA on leaf nodes, it should only be run on master aggregator or child aggregator node types. See Node Requirements for SingleStore Commands for more information. Note that when running this command on reference tables you must connect to the master aggregator.

The mysqlimport utility can also be used to import data into SingleStore. mysqlimport uses LOAD DATA internally.

SingleStore stores information about errors encountered during each LOAD DATA operation, but the number of errors is limited to 1000 by default. When this limit is reached, the load fails. This prevents out-of-memory issues when unintentionally loading large files with incorrect format or an incorrect LOAD DATA statement. Use MAX_ERRORS at the end of the statement to change this limit. To specify no limit, set MAX_ERRORS to 0.

Writing to multiple databases in a transaction is not supported.

CSV LOAD DATA

Syntax

LOAD DATA [LOCAL] INFILE '<file_name>'
[REPLACE | IGNORE | SKIP { ALL | CONSTRAINT | DUPLICATE KEY | PARSER } ERRORS]
INTO TABLE <table_name>
[CHARACTER SET <character_set_name>]
[{FIELDS | COLUMNS}
[TERMINATED BY '<string>']
[[OPTIONALLY] ENCLOSED BY '<char>']
[ESCAPED BY '<char>']
]
[LINES
[STARTING BY '<string>']
[TERMINATED BY '<string>']
]
[TRAILING NULLCOLS]
[NULL DEFINED BY <string> [OPTIONALLY ENCLOSED]]
[IGNORE <number> LINES]
[ ({<column_name> | @<variable_name>}, ...) ]
[SET <column_name> = <expression>,...]
[WHERE <expression>,...]
[MAX_ERRORS <number>]
[ERRORS HANDLE <string>]

Remarks

  • Error Logging and Error Handling are discussed at the end of this topic.

  • To specify the compression type of an input file, use the COMPRESSION clause. See Handling Data Compression for more information.

  • If a CSV file appears to have the incorrect number of fields in any line, you can use the SKIP PARSER ERRORS option to skip the line. LOAD DATA reports a warning for every line that is skipped.

    Important

    Lines in a CSV file may appear to have the wrong number of fields if the FIELDS TERMINATED BY, FIELDS ENCLOSED BY, or ESCAPED BY clauses are incorrectly configured. If LOAD DATA incorrectly finds the start of the next line in a CSV after a parser error, it may parse all the subsequent lines incorrectly. For these reasons, investigate the CSV input and configuration settings mentioned above before using SKIP PARSER ERRORS.

  • The SKIP ALL ERRORS option is inclusive of the SKIP PARSER ERRORS, SKIP DUPLICATE KEY ERRORS and SKIP CONSTRAINT ERRORS options, i.e., specifying the SKIP ALL ERRORS option in a LOAD DATA query applies the behavior of the other three options.

  • The TERMINATED BY clause allows you to define field, column, and line delimiters so that the input data is interpreted and read correctly. For example, use FIELDS TERMINATED BYclause to load a CSV file where the fields are delimited by commas. Additionally, use the LINES TERMINATED BY '\r\n' clause if the lines in the CSV file are terminated by carriage return/newline pairs.

  • The ENCLOSED BY or equivalent OPTIONALLY ENCLOSED BY clause allows you to specify a string that encloses the field values. For example, use the ENCLOSED BY '"' clause to load a CSV file where the fields are enclosed within double quotation. Note that LOAD DATA will still load a field value even if it is not enclosed.

  • The ESCAPED BY clause allows you to specify the escape character. For example, if the input data contains special character(s), you may need to escape those characters to avoid misinterpretation. Also, you may need to redefine the default escape character to load a data set that contains the said character.

  • Many characters can be an escape. If the FIELDS ESCAPED BY clause is empty, the character escape sequence will do nothing.

  • The STARTING BY clause allows you to load only those lines of data that include a specified string (or prefix). While loading data, the STARTING BY clause skips the specified prefix and anything before it. It also skips the lines that do not contain the specified prefix.

    If no FIELDS or LINES clause is specified, then SingleStore uses the following defaults:

    FIELDS TERMINATED BY '\t'
    ENCLOSED BY ''
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'
    STARTING BY ''
  • The TRAILING NULLCOLS clause allows the input file to contain rows having fewer than the number of columns in the table. These missing fields must be trailing columns in the row; they are inserted as NULL values in the table. See Using the TRAILING NULLCOLS Clause.

  • The NULL DEFINED BY clause inserts NULL field values in the table for fields in the input file having the value string_to_insert_as_null. The OPTIONALLY ENCLOSED option ensures that a quoted field is also treated as NULL, not an empty string. Refer to Using the NULL DEFINED BY Clause for more information.

    Note: If the string value 'NULL' is passed to a number-type column (for example, DECIMAL), it is parsed as a string and converted to 0. To insert NULL values instead, use the NULL DEFINED BY 'NULL' OPTIONALLY ENCLOSED clause. You can use the ENCLOSED BY clause in conjunction to specify the string that encloses the NULL values.

  • The IGNORE <number> LINES clause ignores the specified lines from the beginning of the input file. For example, use IGNORE 1 LINES to skip the header line that contains the column names.

Examples

Loading Data when the Order of the Columns in the Destination Table and Source File are Different

If the order of columns in the table is different from the order in the source file, you can name them explicitly. In this example, the columns are loaded in reverse order:

LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (fourth, third, second, first);

Skipping Columns in the Source File

You can skip columns in the source file using the @ sign. In this example only the first and fourth columns are imported into table foo:

LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (bar, @, @, baz);

Specifying the Column Delimiter

The default column delimiter is the tab (t) character, ASCII code 09. You can specify a different delimiter, even multi-character delimiters, with the COLUMNS TERMINATED BY clause:

LOAD DATA INFILE 'foo.csv'
INTO TABLE foo
COLUMNS TERMINATED BY ',';

In the following example, field and line delimiters are used to read a file that contains fields separated by commas and lines terminated by carriage return/newline pairs:

LOAD DATA INFILE 'foo.csv' INTO TABLE foo FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';

Source File with Unusual Column Separators

The following example demonstrates loading a file that has unusual column separators (|||):

LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo
COLUMNS TERMINATED BY '|||';

Loading Data from Multiple Files

Using globbing, you can load data from multiple files in a single LOAD DATA query.

The following query loads data from all the .csv files with names starting with a digit:

LOAD DATA INFILE "[0-9]*.csv"
INTO TABLE cust(ID,NAME,ORDERS);

The following query loads data from all the .csv files with filenames having four characters:

LOAD DATA INFILE "????.csv"
INTO TABLE cust(ID,NAME,ORDERS);

The following query loads data from all the .csv files with filenames not starting with a number:

LOAD DATA INFILE "[!0-9]*.csv"
INTO TABLE cust(ID,NAME,ORDERS);

Note

LOAD DATA LOCAL INFILE does not support globbing.

LOAD DATA INFILE supports globbing in filenames, but not in directory names.

CREATE PIPELINE contains a LOAD DATA clause. Here, LOAD DATA supports globbing, both in directory names and filenames.

Using the TRAILING NULLCOLS Clause

The following example demonstrates how to use the TRAILING NULLCOLS clause using the file numbers.csv , with the following content:

1,2,3
4,5
6

Run the following commands:

CREATE TABLE foo(a INT, b INT, c INT);
LOAD DATA INFILE 'numbers.csv' INTO TABLE foo COLUMNS TERMINATED BY ',' TRAILING NULLCOLS;
SELECT * FROM foo;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    2 |    3 |
|    4 |    5 | NULL |
|    6 | NULL | NULL |
+------+------+------+

Using the NULL DEFINED BY Clause

The following example demonstrates how to use the NULL DEFINED BY clause using the data.csv file.

cat data.csv
DTB,'',25
SPD,,40
SELECT * FROM stockN;
+------+-------------+-------+
| ID   | City        | Count |
+------+-------------+-------+
| XCN  | new york    |    45 |
| ZDF  | washington  |    20 |
| XCN  | chicago     |    32 |
+------+-------------+-------+

The following query inserts the un-enclosed empty field as a NULL value and the enclosed empty field as an empty string.

LOAD DATA INFILE '/data.csv'
INTO TABLE stockN
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY "'"
NULL DEFINED BY '';
SELECT * FROM stockN;
+------+-------------+-------+
| ID   | City        | Count |
+------+-------------+-------+
| XCN  | new york    |    45 |
| ZDF  | washington  |    20 |
| XCN  | chicago     |    32 |
| DTB  |             |    25 |
| SPD  | NULL        |    40 |
+------+-------------+-------+

If you add the OPTIONALLY ENCLOSED option to the NULL DEFINED BY clause in the query above, and run the following query instead, both the empty fields are inserted as a NULL value:

LOAD DATA INFILE '/data.csv'
INTO TABLE stockN
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY "'"
NULL DEFINED BY '' OPTIONALLY ENCLOSED;
SELECT * FROM stockN;
+------+-------------+-------+
| ID   | City        | Count |
+------+-------------+-------+
| XCN  | new york    |    45 |
| ZDF  | washington  |    20 |
| XCN  | chicago     |    32 |
| DTB  | NULL        |    25 |
| SPD  | NULL        |    40 |
+------+-------------+-------+

Using the IGNORE LINES Clause

In the following example, the IGNORE LINES clause is used to skip the header line that contains column names in the source file:

LOAD DATA INFILE '/tmp/data.txt' INTO City IGNORE 1 LINES;

Using the ESCAPED BY Clause

The following example demonstrates how to load data into the loadEsc table using the ESCAPED BY clause from the file contacts.csv, whose contents are shown below.

GALE\, ADAM, Brooklyn
FLETCHER\, RON, New York
WAKEFIELD\, CLARA, DC
DESC loadEsc;
+-------+-------------+------+------+---------+-------+
| Field | Type        | Null | Key  | Default | Extra |
+-------+-------------+------+------+---------+-------+
| Name  | varchar(40) | YES  |      | NULL    |       |
| City  | varchar(40) | YES  |      | NULL    |       |
+-------+-------------+------+------+---------+-------+

Execute the following query:

LOAD DATA INFILE '/contacts.csv'
INTO TABLE loadEsc COLUMNS TERMINATED BY ',' ESCAPED BY '\\' ;
SELECT * FROM loadEsc;
+-------------------+-----------+
| Name              | City      |
+-------------------+-----------+
| GALE, ADAM        |  Brooklyn |
| FLETCHER, RON     |  New York |
| WAKEFIELD, CLARA  |  DC       |
+-------------------+-----------+

In this query, the \ character escapes the comma (,) between the first two fields of the contacts.csv file. (The \ (backslash) is the default escape character in a SQL query. Hence, the \\ (double backslash) is used escape the backslash itself inside the query.)

Warning

If you (accidentally) escape the TERMINATED BY character in a file, the SQL query may return an error. For example, if you escape both the commas in any row of the contacts.csv file mentioned above, as:

GALE\, ADAM\, Brooklyn
FLETCHER\, RON, New York
WAKEFIELD\, CLARA, DC

and then execute the following query

LOAD DATA INFILE '/contacts.csv'
INTO TABLE loadEsc COLUMNS TERMINATED BY ',' ESCAPED BY '\\' ;

it returns the following error: ERROR 1261 (01000): Row 1 does not contain data for all columns. Because, the \ (backslash) escapes both the commas and LOAD DATA perceives the first row as a single column.

Using the STARTING BY Clause

The following example demonstrates how to skip the prefix ### in the stockUpd.txt data file using the STARTING BY clause.

cat stockUpd.txt
###1,"xcg",
3,"dfg"
new product###4,"rfk",5
LOAD DATA INFILE 'stockUpd.txt'
INTO TABLE stock
FIELDS TERMINATED BY ','
LINES STARTING BY '###';
SELECT * FROM stock;
+----+------+----------+
| ID | Code | Quantity |
+----+------+----------+
|  1 |  xcg |       10 |
|  4 |  rfk |        5 |
+----+------+----------+

In this example, the STARTING BY clause skips the prefix ### in the first and third lines and anything before it. It skips the second line, because it does not contain ###.

Filtering out Rows from the Source File

You can also filter out unwanted rows using the WHERE clause. In this example, only rows where bar is equal to 5 will be loaded. All other rows will be discarded:

LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo (bar, baz)
WHERE bar = 5;

Filtering out and Transforming Rows From the Source File

Complex transformations can be performed in both the SET and WHERE clauses. For example, if you have an input file with a EventDate field and an EventId field:

10-1-2016,1
4-15-2016,2
1-10-2017,3
4-10-2017,4

You want to only load the rows with a date that is within three months from a certain date, 10/15/2016, for instance. This can be accomplished by the following:

CREATE TABLE foo (EventDate date, EventId int);
LOAD DATA INFILE 'date_event.csv'
INTO TABLE foo
FIELDS TERMINATED BY ','
(@EventDate, EventId)
SET EventDate = STR_TO_DATE(@EventDate, '%m-%d-%Y')
WHERE ABS(MONTHS_BETWEEN(EventDate, date('2016-10-15'))) < 3;
SELECT * FROM t;
+------------+---------+
| EventDate  | EventId |
+------------+---------+
| 2016-10-01 |       1 |
| 2017-01-10 |       3 |
+------------+---------+

While both column names and variables can be referenced in the WHERE clause column names can only be assigned to in the SET clause. The scope of these clauses is restricted to the current row and therefore SELECT statements cannot be evaluated.

Using REPLACE

This example uses the cust table, which is defined as a columnstore table as follows:

CREATE TABLE cust(name VARCHAR(32), id INT(11), orders INT(11), SORT KEY(id), UNIQUE KEY(id) USING HASH, SHARD KEY(id));

Assume the directory /order_files has one file orders.csv, which contains the following data:

Chris,7214,6
Elen,8301,4
Adam,3412,5
Rachel,9125,2
Susan,8301,7
George,3412,9

Create a LOAD DATA statement with a REPLACE clause:

LOAD DATA INFILE '/order_files/orders.csv' REPLACE INTO TABLE cust FIELDS TERMINATED BY ',';

As LOAD DATA ingests the data from orders.csv into the cust table, it encounters the fifth and sixth records in the file, which contain the duplicate keys 8301 and 3412. The second and third records containing those duplicate keys (which have already been imported into cust), are replaced with the fifth and second records.

SELECT * FROM cust ORDER BY name;
+--------+------+--------+
| name   | id   | orders |
+--------+------+--------+
| Chris  | 7214 |      6 |
| George | 3412 |      9 |
| Rachel | 9125 |      2 |
| Susan  | 8301 |      7 |
+--------+------+--------+

Loading a Fixed Length File

This example demonstrates how to load the contents of the file fixed_length.csv, whose contents are shown below.

APE602020-06-01
TR 252019-08-07
HSW8 2019-10-11
YTR122020-09-02

LOAD DATA inserts each extracted row from fixed_length.csv into the table foo. Define the table as follows:

CREATE TABLE foo(a CHAR(3), b INT, c DATETIME);

Run the LOAD DATA statement:

LOAD DATA INFILE '/fixed_length.csv'
INTO TABLE foo (@current_row)
SET a = TRIM(SUBSTR(@current_row,1,3)),
b = TRIM(SUBSTR(@current_row,4,2)),
c = TRIM(SUBSTR(@current_row,6,10));

SUBSTR() extracts a substring from a string and TRIM() removes the padding (spaces in this case) from the beginning and the ending of a string. For example, after the LOAD DATA statement extracts the line HSW8 2019-10-11 in fixed_length.csv, it does the following to set b: * It extracts, from HSW8 2019-10-11, the substring starting at position 4 having a length of 2. The resulting substring is 8. * It removes the leading whitespace from 8 to yield 8.

Retrieve the data from foo:

SELECT * from foo ORDER BY a;
+------+------+---------------------+
| a    | b    | c                   |
+------+------+---------------------+
| APE  |   60 | 2020-06-01 00:00:00 |
| HSW  |    8 | 2019-10-11 00:00:00 |
| TR   |   25 | 2019-08-07 00:00:00 |
| YTR  |   12 | 2020-09-02 00:00:00 |
+------+------+---------------------+

Loading Data using Hex Field Terminator Syntax

Loading data into a table can be performed using a hexadecimal field terminator.

Syntax

CREATE TABLE <table name>(a int, b int);
LOAD DATA infile "/tmp/<file name>.csv" 
INTO TABLE test (a, b) fields terminated by 0x2c;
SELECT * FROM <table name>;
**
+------+------+
|  a   | b    |
+------+------+
|    1 |    2 |
+------+------+

Contents of the  hex_test.csv was created via the CLI using the following commands:

echo 1. 2 > hex_test.csv
hexdump -C hex_test.csv
00000000  31 2e 20 32 0a                            |1. 2.|
00000005

JSON LOAD DATA

Syntax

LOAD DATA [LOCAL] INFILE 'file_name'
[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE tbl_name
FORMAT JSON
subvalue_mapping
[SET col_name = expr,...]
[WHERE expr,...]
[MAX_ERRORS number]
[ERRORS HANDLE string]
subvalue_mapping:
( {col_name | @variable_name} <- subvalue_path [DEFAULT literal_expr], ...)
subvalue_path:
{% | [%::]ident [::ident ...]}

Semantics

Error Logging and Error Handling are discussed at the end of this topic.

Extract specified subvalues from each JSON value in file_name. Assign them to specified columns of a new row in tbl_name, or to variables used for a column assignment in a SET clause. If a specified subvalue can’t be found in an input JSON, assign the DEFAULT clause literal instead. Discard rows which don’t match the WHERE clause.

To specify the compression type of an input file, use the COMPRESSION clause. See Handling Data Compression for more information.

The file named by file_name must consist of concatenated UTF-8 encoded JSON values, optionally separated by whitespace. Newline-delimited JSON is accepted, for example.

Non-standard JSON values like NaN, Infinity, and -Infinity must not occur in file_name.

If file_name ends in .gz or .lz4, it will be decompressed.

JSON LOAD DATA supports a subset of the error recovery options allowed by CSV LOAD DATA. Their behavior is as described under CSV LOAD DATA.

Like CSV LOAD DATA, JSON LOAD DATA allows you to use globbing to load data from multiple files. See the CSV Examples

Writing to multiple databases in a transaction is not supported.

Extracting JSON Values

subvalue_mapping specifies which subvalues are extracted and the column or variable to which each one is assigned.

LOAD DATA uses the ::-separated list of keys in a subvalue_path to perform successive key lookups in nested JSON objects, as if applying the :: SQL operator. Unlike the :: operator, subvalue_path may not be used to extract an element of a JSON array. The path % refers to the entire JSON value being processed. Leading %:: may be omitted from paths which are otherwise non-empty.

If a path can’t be found in an input JSON value, then if the containing element of subvalue_mapping has a DEFAULT clause, its literal_expr will be assigned; otherwise, LOAD DATA will terminate with an error.

Path components containing whitespace or punctuation must be surrounded by backticks. For example, the paths %::`a.a`::b and `a.a`::b will both extract 1 from the input object {"a.a":{"b":1},"c":2}.

Array elements may be indirectly extracted by applying JSON_EXTRACT_<type> in a SET clause.

Converting JSON Values

Before assignment or set clause evaluation, the JSON value extracted according to a subvalue_path is converted to a binary collation SQL string whose value depends on the extracted JSON type as follows:

JSON Type

Converted Value

null

SQL NULL

true/false

"1"/"0"

number

Verbatim, from extracted string.

string

All JSON string escape sequences, including escape sequences are converted to UTF-8. Verbatim otherwise.

array

Verbatim, from extracted string. For example, '[1,2]'

object

Verbatim, from extracted string. For example, '{"k":true}'

Conversion is not recursive. So, for example, true is not converted to "1" when it is a subvalue of an object which is being extracted whole.

JSON LOAD DATA Examples

To use an ENCLOSED BY <char> as a terminating field, a TERMINATED BY clause is needed. For clarity, instances of an ENCLOSED BY <char> appearing within a field value can be duplicated, and they will be understood as a singular occurrence of the character.

If an ENCLOSED BY "" is used, quotes are treated as follows:

  • "The ""NEW"" employee"  → The "NEW" employee

  • The "NEW" employee → The "NEW" employee

  • The ""NEW"" employee → The ""NEW"" employee

Example 1

If example.json consists of:

{"a":{"b":1}, "c":null}
{"a":{"b":2}, "d":null}

Then it can be loaded as follows:

CREATE TABLE t(a INT);
LOAD DATA LOCAL INFILE "example.json" INTO TABLE t(a <- a::b) FORMAT JSON;
SELECT * FROM t;
+------+
| a    |
+------+
|    1 |
|    2 |
+------+

Example 2

If example2.json consists of:

{"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}}
{"b":false}
"hello"

Then we can perform a more complicated LOAD DATA:

CREATE TABLE t(b bool NOT NULL, s TEXT, n DOUBLE, a INT, o JSON NOT NULL, whole longblob);
LOAD DATA LOCAL INFILE "example2.json" INTO TABLE t FORMAT JSON(
b <- b default true,
s <- s default NULL,
n <- n default NULL,
@avar <- a default NULL,
o <- o default '{"subobject":"replaced"}',
whole <- %)
SET a = json_extract_double(@avar, 1)
WHERE b = true;
SELECT * FROM t;
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| b | s     | n                       | a    | o                        | whole                                                                                         |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| 1 | A®"A  | -1.4820790816978637e-25 |    2 | {"subobject":1}          | {"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}} |
| 1 | NULL  |                    NULL | NULL | {"subobject":"replaced"} | hello                                                                                         |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+

There are several things to note in the example above:

  • true was converted to "1" for columns b, but not for column whole. "1" was further converted to the BOOL value 1.

  • The escapes"\u00AE" and "\u0022" were converted to UTF-8 for column s, but not for column whole. Note that whole would have become invalid JSON if we had translated "\u0022".

  • The second row was discarded because it failed to match the WHERE clause.

  • None of the paths in subvalue_mapping could be found in the third row, so DEFAULT literals like '{"subobject":"replaced"}' were assigned instead.

  • We assigned a to an intermediate variable so that we could extract an array element in the SET clause.

  • The top-level JSON values in example2.json were not all JSON objects. "hello" is a valid top-level JSON value.

Loading JSON Data from a CSV File

To use an ENCLOSED BY <char> as a terminating field, a TERMINATED BY clause is needed. For clarity, instances of an ENCLOSED BY <char> appearing within a field value can be duplicated, and they will be understood as a singular occurrence of the character.

If an ENCLOSED BY "" is used, the quotes are treated as follows:

  • "The ""New"" employee"  → The "NEW" employee

  • The "New" employee → The "NEW" employee

  • The ""NEW"" employee → The ""NEW"" employee

Example 1

An ENCLOSED BY clause is required when a csv file has a JSON column enclosed with double quotation marks (" ").

CREATE TABLE employees(emp_id int, data JSON);
csv file contents
emp_id,data
159,"{""name"": ""Damien Karras"", ""age"": 38, ""city"": ""New York""}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employees
    FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
    IGNORE 1 LINES;
SELECT * FROM employees;
+--------+-----------------------------------------------------+
| emp_id | data                                                |
+--------+-----------------------------------------------------+
|    159 | {"age":38,"city":"New York","name":"Damien Karras"} |
+--------+-----------------------------------------------------+

Example 2

An ESCAPED BY clause is required when a character is specified as an escape character for a string. The example below uses a backslash (\) as the escape character.

csv file contents
emp_id,data
298,"{\"name\": \"Bill Denbrough\", \"age\": 25, \"city\": \"Bangor\"}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employees
    INTO TABLE employees
    FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
    IGNORE 1 LINES;
SELECT * FROM employees;
+--------+-----------------------------------------------------+
| emp_id | data                                                |
+--------+-----------------------------------------------------+
|    298 | {"age":25,"city":"Bangor","name":"Bill Denbrough"}  |
|    159 | {"age":38,"city":"New York","name":"Damien Karras"} |
+--------+-----------------------------------------------------+

Example 3

This example will fail as the JSON field in the csv file is not in the correct format.

csv file contents
emp_id,data
410,"{"name": "Annie Wilkes", "age": 45, "city":"Silver Creek"}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employees
    FIELDS TERMINATED BY ','
    ENCLOSED BY '{'
    IGNORE 1 LINES;
ERROR 1262 (01000): Leaf Error (127.0.0.1:3307): Row 1 was truncated; it contained more data than there were input columns

Example 4

An ENCLOSED BY clause is required when a csv file has a JSON column enclosed with curly brackets ({ }).

csv file contents
emp_id,data
089,{"name": "Wilbur Whateley","age": 62,"city": "Dunwich"}
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employees
    FIELDS TERMINATED BY ','
    ENCLOSED BY '{'
    IGNORE 1 LINES;
SELECT * FROM employees;
+--------+------------------------------------------------------+
| emp_id | data                                                 |
+--------+------------------------------------------------------+
|    298 | {"age":25,"city":"Bangor","name":"Bill Denbrough"}   |
|    159 | {"age":38,"city":"New York","name":"Damien Karras"}  |
|     89 | {"age":62,"city":"Dunwich","name":"Wilbur Whateley"} |
+--------+------------------------------------------------------+

Avro LOAD DATA

Syntax for LOAD DATA Local Infile

LOAD DATA [LOCAL] INFILE 'file_name'
WHERE/SET/SKIP ERRORS[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE tbl_name
FORMAT AVRO SCHEMA REGISTRY {"IP" | "Hostname"}
subvalue_mapping
[SET col_name = expr,...]
[WHERE expr,...]
[MAX_ERRORS number]
[ERRORS HANDLE string]
[SCHEMA 'avro_schema']
subvalue_mapping:
( {col_name | @variable_name} <- subvalue_path, ...)
subvalue_path:
{% | [%::]ident [::ident ...]}

See the associated GitHub repo.

Semantics

Error Logging and Error Handling are discussed at the end of this topic.

LOAD DATA for Avro does not support file name globbing (for example: LOAD DATA INFILE '/data/nfs/gp1/*.avro). LOAD DATA for Avro only supports loading a single file per statement.

Extract specified subvalues from each Avro value in file_name. Assign them to specified columns of a new row in tbl_name, or to variables used for a column assignment in a SET clause. Discard rows which don’t match the WHERE clause.

To specify the compression type of an input file, use the COMPRESSION clause. See Handling Data Compression for more information.

Avro LOAD DATA expects Avro data in one of two sub-formats, depending on the SCHEMA clause.

If no SCHEMA clause is provided, file_name must name an Avro Object Container File as described in version 1.8.2 of the Avro specification. In addition, the following restrictions hold:

  • The compression codec of the file must be null.

  • Array and map values must not have more than 16384 elements.

  • The type name of a record must not be used in a symbolic reference to previously defined name in any of its fields. It may still be used in a symbolic reference outside the record definition, however.

    For example, self-referential schemas like the following are rejected by LOAD DATA:

    {
    "type": "record",
    "name": "PseudoLinkedList",
    "fields" : [{"name": "value", "type": "long"},
    {"name": "next", "type": ["null", "PseudoLinkedList"]}]
    }

If a SCHEMA clause is provided, the file must be a raw stream consisting of only the concatenated binary encodings of instances of avro_schema. avro_schema must be a SQL string containing a JSON Avro schema. The restrictions on Object Container Files also apply to raw stream files.

Warning

It’s an error to provide a SCHEMA clause when loading an Object Container File because it contains metadata alongside the encoded values.

All optional Avro schema attributes except the namespace attribute are ignored. Notably, logicalType attributes are ignored.

If file_name ends in .gz or .lz4, it will be decompressed.

Avro LOAD DATA supports a subset of the error recovery options allowed by CSV LOAD DATA. Their behavior is as described under CSV LOAD DATA.

Writing to multiple databases in a transaction is not supported.

The SCHEMA REGISTRY {"IP" | "Hostname"} option allows LOAD DATA to pull the schema from a schema registry. For more information, see the Avro Schema Evolution With Pipelines topic.

Extracting Avro Values

subvalue_mapping specifies which subvalues are extracted and the column or variable to which each one is assigned.

LOAD DATA uses the ::-separated list of names in a subvalue_path to perform successive field name or union branch type name lookups in nested Avro records or unions. subvalue_path may not be used to extract elements of Avro arrays or maps. The path % refers to the entire Avro value being processed. Leading %:: may be omitted from paths which are otherwise non-empty.

If a path can’t be found in an input Avro value, then: * If a prefix of the path matches a record whose schema has no field matching the next name in the path, then LOAD DATA will terminate with an error. * If a prefix matches a union whose schema has no branch matching the next name, then LOAD DATA will terminate with an error. * If a prefix matches a union whose schema has a branch matching the next name, but that branch isn’t the selected branch in that instance of the union schema, then Avro null will be extracted instead and LOAD DATA will continue.

Path components naming union branches must use the two-part fullname of the branch’s type if that type is in a namespace.

Path components containing whitespace or punctuation must be surrounded by backticks.

Array and map elements may be indirectly extracted by applying JSON_EXTRACT_<type> in a SET clause.

For example, consider two Avro records with the union schema:

[
"int",
{ "type" : "record",
"name" : "a",
"namespace" : "n",
"fields" : [{ "name" : "f1",
"type" : "int" }]
}
]

The paths %::`n.a`::f1 and `n.a`::f1 will both extract 1 from an instance of this schema whose JSON encoding is {"n.a":{"f1":1}}.

They will extract null from an instance whose encoding is {"int":2}.

The paths %::int and int will extract 2 from the second instance and null from the first.

Converting Avro Values

Before assignment or set clause evaluation, the Avro value extracted according to a subvalue_path is converted to an unspecified SQL type which may be further explicitly or implicitly converted as if from a SQL string whose value is as follows:

Avro Type

Converted Value

null

SQL NULL

boolean

"1"/"0"

int

The string representation of the value

long

The string representation of the value

float

SQL NULL if not finite. Otherwise, a string convertible without loss of precision to FLOAT

double

SQL NULL if not finite. Otherwise, a string convertible without loss of precision to DOUBLE

enum

The string representation of the enum.

bytes

Verbatim, from input bytes

string

Verbatim, from input bytes

fixed

Verbatim, from input bytes

record

The JSON encoding of the value

map

The JSON encoding of the value

array

The JSON encoding of the value

union

The JSON encoding of the value

logicalType attributes are ignored and have no effect on conversion.

Avro LOAD DATA Examples

Example 1

Consider an Avro Object Container Fileexample.avro with the following schema:

{
"type": "record",
"name": "data",
"fields": [{ "name": "id", "type": "long"},
{ "name": "payload", "type": [ "null",
"string" ]}]
}

example.avro contains three Avro values whose JSON encodings are:

{"id":1,"payload":{"string":"first"}}
{"id":1,"payload":{"string":"second"}}
{"id":1,"payload":null}

example.avro can be loaded as follows:

CREATE TABLE t(payload TEXT, input_record JSON);
LOAD DATA LOCAL INFILE "example.avro"
INTO TABLE t
FORMAT AVRO
( payload <- %::payload::string,
input_record <- % );
SELECT * FROM t;
+---------+----------------------------------------+
| payload | input_record                           |
+---------+----------------------------------------+
| first   | {"id":1,"payload":{"string":"first"}}  |
| second  | {"id":1,"payload":{"string":"second"}} |
| NULL    | {"id":1,"payload":null}                |
+---------+----------------------------------------+

LOAD DATA was able to parse example.avro because Avro Object Container Files have a header which contains their schema.

Example 2

Consider a file named example.raw_avro, with the same values as example.avro from Example 1 but in the raw stream format. That is, example.raw_avro consists of the binary encoded values and nothing else. We add a SCHEMA clause to tell LOAD DATA to expect a raw stream with the provided schema:

CREATE TABLE t(payload TEXT, input_record JSON);
LOAD DATA LOCAL INFILE "example.raw_avro"
INTO TABLE t
FORMAT AVRO
( payload <- %::payload::string,
input_record <- % )
schema
'{
"type": "record",
"name": "data",
"fields": [{ "name": "id", "type": "long"},
{ "name": "payload", "type": [ "null", "string" ]}]
}';
SELECT * FROM t;
+---------+----------------------------------------+
| payload | input_record                           |
+---------+----------------------------------------+
| first   | {"id":1,"payload":{"string":"first"}}  |
| second  | {"id":1,"payload":{"string":"second"}} |
| NULL    | {"id":1,"payload":null}                |
+---------+----------------------------------------+

Example 3

Consider an Object Container Fileexample3.avro with a more complicated payload than Example 1. We illustrate extracting values from nested unions and records, and also indirectly extracting elements of nested maps and arrays.

{ "type": "record",
"namespace": "ns",
"name": "data",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "payload", "type":
[ "null",
{ "type": "record",
"name": "payload_record",
"namespace": "ns",
"fields": [
{ "name": "f_bytes", "type": "bytes"},
{ "name": "f_string", "type": "string"},
{ "name": "f_map", "type":
{ "type": "map",
"values": { "type": "array",
"items": "int" }}}
]
}
]
}
]
}

The raw JSON encoding of the contents of this file can be seen in column c_whole_raw after the following LOAD DATA:

CREATE TABLE t (
c_id bigint,
c_bytes longblob,
c_string longblob,
c_array_second int,
c_whole_raw longblob,
c_whole_json json
);
LOAD DATA INFILE "example3.avro"
INTO TABLE t
FORMAT AVRO
( c_id <- %::id,
c_bytes <- %::payload::`ns.payload_record`::f_bytes,
c_string <- %::payload::`ns.payload_record`::f_string,
@v_map <- %::payload::`ns.payload_record`::f_map,
c_whole_raw <- %,
c_whole_json <- %)
SET c_array_second = JSON_EXTRACT_JSON(@v_map, "a", 1);
SELECT * FROM t;
*** 1. row ***
          c_id: 1
       c_bytes: NULL
      c_string: NULL
c_array_second: NULL
   c_whole_raw: {"id":1,"payload":null}
  c_whole_json: {"id":1,"payload":null}
*** 2. row ***
          c_id: 2
       c_bytes: "A
      c_string: "A
c_array_second: 2
   c_whole_raw: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\u0022\u0041","f_string":"\"A","f_map":{"a":[1,2]}}}}
  c_whole_json: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\"A","f_map":{"a":[1,2]},"f_string":"\"A"}}}

There are several things to note:

  • We attempted to extract subvalues of the payload_record branch of the union-type payload field. Since that wasn’t the selected member of the union in record 1, LOAD DATA assigned NULL to c_bytes and @v_map.

  • We assigned the JSON encoding of f_map to @v_map and then performed JSON map and array lookups in the SET clause to ultimately extract 2.

  • f_string and f_bytes had the same contents, but we can see how their different Avro types affected their JSON encodings and interacted with the SQL JSON type

    • The JSON encoding of the Avro string value f_string, as seen in c_whole_raw, encodes special characters like " as the escape sequence \".

    • The JSON encoding of the Avro bytes value f_bytes, as seen in c_whole_raw, encodes every byte with a JSON escape.

    • When converting the JSON encoding of record 2 to the SQL JSON type while assigning to c_whole_json, LOAD DATA normalized both representations of the byte sequence "A to \"A.

Loading Parquet Data

The LOAD DATA command does not support loading Parquet files. However, you can use the LOAD DATA clause in a CREATE PIPELINE .. FORMAT statement to create a pipeline that loads Parquet files.

Handling Data Compression

The COMPRESSION clause specifies how LOAD DATA handles the compression of an input file.

Syntax for LOAD DATA Local Infile

LOAD DATA INFILE 'filename' COMPRESSION { AUTO | NONE | LZ4 | GZIP } INTO TABLE ...
LOAD DATA INFILE 'filename' INTO TABLE `tablename` COMPRESSION { AUTO | NONE | LZ4 | GZIP } ...

Arguments

  • AUTO: This is the default setting, it tells LOAD DATA to identify the compression type from the input file’s extension.

  • NONE: Specifies that the input file is uncompressed.

  • LZ4: Specifies that the input file is compressed with LZ4 compression algorithm.

  • GZIP: Specifies that the input file is compressed with GZIP compression algorithm.

Remarks

  • If COMPRESSION is set to NONE, LZ4, or GZIP, LOAD DATA will not use the extension of the input file to determine the type of compression. For example, if you load a file test.gz and specify the COMPRESSION as NONE, then LOAD DATA will handle test.gz as an uncompressed file.

LOCAL

LOCAL affects the expected file location, the search behavior for relative path names, and Error Handling behavior.

When you specify LOCAL, the client reads file_name and sends it to the server. The expected location of the file is within the client directory. If file_name is a relative path, it is relative to the current working directory of the client.

When LOCAL is not specified, the file is read by the server, and needs to be located on the related server host. If a relative file path name is specified, it is searched for as relative to the server’s data directory, or as relative to the directory of the default database in cases where no leading components are given.

Because files need to be sent from the client to the server, specifying LOCAL can be slower. However, it does have the security advantage in that the client user must be able to read the file(s) being loaded. Where LOCAL is not specified, the server needs access to the full data directory, meaning that any user who has the permissions to LOAD DATA or CREATE PIPELINE can read the directory. This is because those permissions include FILE READ Permissions Matrix

LOCAL does not support globbing (such as using wildcards in directory or filenames).

An example of using LOCAL follows: