LOAD DATA
On this page
Important
SingleStore only supports LOAD DATA
with the LOCAL
option.LOAD DATA LOCAL
must be run from a SQL client running on a computer that can access your SingleStore instance, such as the MySQL client (mysql-client) or SingleStore client.
Important
SingleStore clusters can be integrated with many third-party ETL and CDC tools.
Import data stored in a CSV, JSON, or Avro file into a SingleStore table (referred to as the destination table in this topic).
Remarks
The syntax and semantics of loading data from a CSV, JSON, or Avro file are detailed below.
IGNORE
, SKIP PARSER ERRORS
, and SKIP ALL ERRORS
are unsupported with non-CSV pipelines.
REPLACE
, SKIP CONSTRAINT ERRORS
, and SKIP DUPLICATE KEY ERRORS
are supported with non-CSV pipelines.
During the import of data stored in any of these files, you can optionally apply operations to the data as follows:
-
Use the
WHERE
clause to do filtering on incoming data.Only rows that satisfy the expression in the WHERE
clause will be loaded into SingleStore.For an example of how to use the WHERE
clause, see the examples section. -
Use the
SET
clause to set columns using specific values or expressions with variables.For example, if your input file has 9 columns but the table has a 10th column called foo
, you can addSET foo=0
orSET foo=@myVariable
.Note that column names may only be used on the left side of SET
expressions. -
Use the
CHARACTER SET
clause to import files with any supported character set into SingleStore.For more information, see Character Encoding.
Refer to the Permission Matrix for the required permission.
Important
If a query uses @
in a LOAD DATA
statement, SingleStore interprets it as a reference to a LOAD DATA
assignment to a variable, not as a reference to a user-defined variable.
The behavior of SingleStore’s LOAD DATA
command has several functional differences from MySQL’s command:
-
LOAD DATA
will load the data into SingleStore in parallel to maximize performance.This makes LOAD DATA
in SingleStore much faster on machines with a larger number of processors. -
LOAD DATA
supports loading compressed .gz files. -
The only supported
charset_
isname utf8
. -
While it is possible to execute
LOAD DATA
on leaf nodes, it should only be run on master aggregator or child aggregator node types.See Cluster Management Commands for more information. Note that when running this command on reference tables you must connect to the master aggregator.
The mysqlimport
utility can also be used to import data into SingleStore.mysqlimport
uses LOAD DATA
internally.
SingleStore stores information about errors encountered during each LOAD DATA
operation, but the number of errors is limited to 1000 by default.LOAD DATA
statement.MAX_
at the end of the statement to change this limit.MAX_
to 0
.
Writing to multiple databases in a transaction is not supported.
CSV LOAD DATA
Syntax
LOAD DATA [LOCAL] INFILE '<file_name>'[REPLACE | IGNORE | SKIP { ALL | CONSTRAINT | DUPLICATE KEY | PARSER } ERRORS]INTO TABLE <table_name>[CHARACTER SET <character_set_name>][{FIELDS | COLUMNS}[TERMINATED BY '<string>'][[OPTIONALLY] ENCLOSED BY '<char>'][ESCAPED BY '<char>']][LINES[STARTING BY '<string>'][TERMINATED BY '<string>']][TRAILING NULLCOLS][NULL DEFINED BY <string> [OPTIONALLY ENCLOSED]][IGNORE <number> LINES][ ({<column_name> | @<variable_name>}, ...) ][SET <column_name> = <expression>,...][WHERE <expression>,...][MAX_ERRORS <number>][ERRORS HANDLE <string>]
Remarks
-
Error Logging and Error Handling are discussed at the end of this topic.
-
To specify the compression type of an input file, use the
COMPRESSION
clause.See Handling Data Compression for more information. -
If a CSV file appears to have the incorrect number of fields in any line, you can use the
SKIP PARSER ERRORS
option to skip the line.LOAD DATA
reports a warning for every line that is skipped.Important
Lines in a CSV file may appear to have the wrong number of fields if the
FIELDS TERMINATED BY
,FIELDS ENCLOSED BY
, orESCAPED BY
clauses are incorrectly configured.If LOAD DATA
incorrectly finds the start of the next line in a CSV after a parser error, it may parse all the subsequent lines incorrectly.For these reasons, investigate the CSV input and configuration settings mentioned above before using SKIP PARSER ERRORS
. -
The
SKIP ALL ERRORS
option is inclusive of theSKIP PARSER ERRORS
,SKIP DUPLICATE KEY ERRORS
andSKIP CONSTRAINT ERRORS
options, i.e. , specifying the SKIP ALL ERRORS
option in aLOAD DATA
query applies the behavior of the other three options. -
The
TERMINATED BY
clause allows you to define field, column, and line delimiters so that the input data is interpreted and read correctly.For example, use FIELDS TERMINATED BY
clause to load a CSV file where the fields are delimited by commas.Additionally, use the LINES TERMINATED BY '\r\n'
clause if the lines in the CSV file are terminated by carriage return/newline pairs. -
The
ENCLOSED BY
or equivalentOPTIONALLY ENCLOSED BY
clause allows you to specify a string that encloses the field values.For example, use the ENCLOSED BY '"'
clause to load a CSV file where the fields are enclosed within double quotation.Note that LOAD DATA
will still load a field value even if it is not enclosed. -
The
ESCAPED BY
clause allows you to specify the escape character.For example, if the input data contains special character(s), you may need to escape those characters to avoid misinterpretation. Also, you may need to redefine the default escape character to load a data set that contains the said character. -
Many characters can be an escape.
If the FIELDS ESCAPED BY
clause is empty, the character escape sequence will do nothing. -
The
STARTING BY
clause allows you to load only those lines of data that include a specified string (or prefix).While loading data, the STARTING BY
clause skips the specified prefix and anything before it.It also skips the lines that do not contain the specified prefix. If no
FIELDS
orLINES
clause is specified, then SingleStore uses the following defaults:FIELDS TERMINATED BY '\t'ENCLOSED BY ''ESCAPED BY '\\'LINES TERMINATED BY '\n'STARTING BY ''
-
The
TRAILING NULLCOLS
clause allows the input file to contain rows having fewer than the number of columns in the table.These missing fields must be trailing columns in the row; they are inserted as NULL
values in the table.See Using the TRAILING NULLCOLS
Clause.
-
The
NULL DEFINED BY
clause insertsNULL
field values in the table for fields in the input file having the valuestring_
.to_ insert_ as_ null The OPTIONALLY ENCLOSED
option ensures that a quoted field is also treated asNULL
, not an empty string.Refer to Using the NULL DEFINED BY
Clause for more information.Note: If the string value
'NULL'
is passed to a number-type column (for example, DECIMAL), it is parsed as a string and converted to 0.To insert NULL
values instead, use theNULL DEFINED BY 'NULL' OPTIONALLY ENCLOSED
clause.You can use the ENCLOSED BY
clause in conjunction to specify the string that encloses theNULL
values. -
The
IGNORE <number> LINES
clause ignores the specified lines from the beginning of the input file.For example, use IGNORE 1 LINES
to skip the header line that contains the column names.
Examples
Loading Data when the Order of the Columns in the Destination Table and Source File are Different
If the order of columns in the table is different from the order in the source file, you can name them explicitly.
LOAD DATA INFILE 'foo.tsv'INTO TABLE foo (fourth, third, second, first);
Skipping Columns in the Source File
You can skip columns in the source file using the @
sign.
LOAD DATA INFILE 'foo.tsv'INTO TABLE foo (bar, @, @, baz);
Specifying the Column Delimiter
The default column delimiter is the tab (t
) character, ASCII code 09.COLUMNS TERMINATED BY
clause:
LOAD DATA INFILE 'foo.csv'INTO TABLE fooCOLUMNS TERMINATED BY ',';
In the following example, field and line delimiters are used to read a file that contains fields separated by commas and lines terminated by carriage return/newline pairs:
LOAD DATA INFILE 'foo.csv' INTO TABLE foo FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
Source File with Unusual Column Separators
The following example demonstrates loading a file that has unusual column separators (|||
):
LOAD DATA INFILE 'foo.oddformat'INTO TABLE fooCOLUMNS TERMINATED BY '|||';
Loading Data from Multiple Files
Using globbing, you can load data from multiple files in a single LOAD DATA
query.
The following query loads data from all the .
files with names starting with a digit:
LOAD DATA INFILE "[0-9]*.csv"INTO TABLE cust(ID,NAME,ORDERS);
The following query loads data from all the .
files with filenames having four characters:
LOAD DATA INFILE "????.csv"INTO TABLE cust(ID,NAME,ORDERS);
The following query loads data from all the .
files with filenames not starting with a number:
LOAD DATA INFILE "[!0-9]*.csv"INTO TABLE cust(ID,NAME,ORDERS);
Note
LOAD DATA LOCAL INFILE
does not support globbing.
LOAD DATA INFILE
supports globbing in filenames, but not in directory names.
CREATE PIPELINE
contains a LOAD DATA
clause.LOAD DATA
supports globbing, both in directory names and filenames.
Using the TRAILING NULLCOLS
Clause
The following example demonstrates how to use the TRAILING NULLCOLS
clause using the file numbers.
, with the following content:
1,2,3
4,5
6
Run the following commands:
CREATE TABLE foo(a INT, b INT, c INT);LOAD DATA INFILE 'numbers.csv' INTO TABLE foo COLUMNS TERMINATED BY ',' TRAILING NULLCOLS;SELECT * FROM foo;
+------+------+------+
| a | b | c |
+------+------+------+
| 1 | 2 | 3 |
| 4 | 5 | NULL |
| 6 | NULL | NULL |
+------+------+------+
Using the NULL DEFINED BY
Clause
The following example demonstrates how to use the NULL DEFINED BY
clause using the data.
file.
cat data.csv
DTB,'',25
SPD,,40
SELECT * FROM stockN;
+------+-------------+-------+
| ID | City | Count |
+------+-------------+-------+
| XCN | new york | 45 |
| ZDF | washington | 20 |
| XCN | chicago | 32 |
+------+-------------+-------+
The following query inserts the un-enclosed empty field as a NULL
value and the enclosed empty field as an empty string.
LOAD DATA INFILE '/data.csv'INTO TABLE stockNCOLUMNS TERMINATED BY ','OPTIONALLY ENCLOSED BY "'"NULL DEFINED BY '';SELECT * FROM stockN;
+------+-------------+-------+
| ID | City | Count |
+------+-------------+-------+
| XCN | new york | 45 |
| ZDF | washington | 20 |
| XCN | chicago | 32 |
| DTB | | 25 |
| SPD | NULL | 40 |
+------+-------------+-------+
If you add the OPTIONALLY ENCLOSED
option to the NULL DEFINED BY
clause in the query above, and run the following query instead, both the empty fields are inserted as a NULL
value:
LOAD DATA INFILE '/data.csv'INTO TABLE stockNCOLUMNS TERMINATED BY ','OPTIONALLY ENCLOSED BY "'"NULL DEFINED BY '' OPTIONALLY ENCLOSED;SELECT * FROM stockN;
+------+-------------+-------+
| ID | City | Count |
+------+-------------+-------+
| XCN | new york | 45 |
| ZDF | washington | 20 |
| XCN | chicago | 32 |
| DTB | NULL | 25 |
| SPD | NULL | 40 |
+------+-------------+-------+
Using the IGNORE LINES
Clause
In the following example, the IGNORE LINES
clause is used to skip the header line that contains column names in the source file:
LOAD DATA INFILE '/tmp/data.txt' INTO City IGNORE 1 LINES;
Using the ESCAPED BY
Clause
The following example demonstrates how to load data into the loadEsc
table using the ESCAPED BY
clause from the file contacts.
, whose contents are shown below.
GALE\, ADAM, Brooklyn
FLETCHER\, RON, New York
WAKEFIELD\, CLARA, DC
DESC loadEsc;
+-------+-------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+------+---------+-------+
| Name | varchar(40) | YES | | NULL | |
| City | varchar(40) | YES | | NULL | |
+-------+-------------+------+------+---------+-------+
Execute the following query:
LOAD DATA INFILE '/contacts.csv'INTO TABLE loadEsc COLUMNS TERMINATED BY ',' ESCAPED BY '\\' ;SELECT * FROM loadEsc;
+-------------------+-----------+
| Name | City |
+-------------------+-----------+
| GALE, ADAM | Brooklyn |
| FLETCHER, RON | New York |
| WAKEFIELD, CLARA | DC |
+-------------------+-----------+
In this query, the \
character escapes the comma (,
) between the first two fields of the contacts.\
(backslash) is the default escape character in a SQL query.\\
(double backslash) is used escape the backslash itself inside the query.
Warning
If you (accidentally) escape the TERMINATED BY
character in a file, the SQL query may return an error.
GALE\, ADAM\, Brooklyn
FLETCHER\, RON, New York
WAKEFIELD\, CLARA, DC
and then execute the following query
LOAD DATA INFILE '/contacts.csv'INTO TABLE loadEsc COLUMNS TERMINATED BY ',' ESCAPED BY '\\' ;
it returns the following error: ERROR 1261 (01000): Row 1 does not contain data for all columns.\
(backslash) escapes both the commas and LOAD DATA
perceives the first row as a single column.
Using the STARTING BY
Clause
The following example demonstrates how to skip the prefix ###
in the stockUpd.
data file using the STARTING BY
clause.
cat stockUpd.txt
###1,"xcg",
3,"dfg"
new product###4,"rfk",5
LOAD DATA INFILE 'stockUpd.txt'INTO TABLE stockFIELDS TERMINATED BY ','LINES STARTING BY '###';SELECT * FROM stock;
+----+------+----------+
| ID | Code | Quantity |
+----+------+----------+
| 1 | xcg | 10 |
| 4 | rfk | 5 |
+----+------+----------+
In this example, the STARTING BY
clause skips the prefix ###
in the first and third lines and anything before it.###
.
Filtering out Rows from the Source File
You can also filter out unwanted rows using the WHERE
clause.
LOAD DATA INFILE 'foo.oddformat'INTO TABLE foo (bar, baz)WHERE bar = 5;
Filtering out and Transforming Rows From the Source File
Complex transformations can be performed in both the SET
and WHERE
clauses.EventDate
field and an EventId
field:
10-1-2016,1
4-15-2016,2
1-10-2017,3
4-10-2017,4
You want to only load the rows with a date that is within three months from a certain date, 10/15/2016, for instance.
CREATE TABLE foo (EventDate date, EventId int);LOAD DATA INFILE 'date_event.csv'INTO TABLE fooFIELDS TERMINATED BY ','(@EventDate, EventId)SET EventDate = STR_TO_DATE(@EventDate, '%m-%d-%Y')WHERE ABS(MONTHS_BETWEEN(EventDate, date('2016-10-15'))) < 3;SELECT * FROM t;
+------------+---------+
| EventDate | EventId |
+------------+---------+
| 2016-10-01 | 1 |
| 2017-01-10 | 3 |
+------------+---------+
While both column names and variables can be referenced in the WHERE
clause column names can only be assigned to in the SET
clause.SELECT
statements cannot be evaluated.
Using REPLACE
This example uses the cust
table, which is defined as a columnstore table as follows:
CREATE TABLE cust(name VARCHAR(32), id INT(11), orders INT(11), SORT KEY(id), UNIQUE KEY(id) USING HASH, SHARD KEY(id));
Assume the directory /order_
has one file orders.
, which contains the following data:
Chris,7214,6
Elen,8301,4
Adam,3412,5
Rachel,9125,2
Susan,8301,7
George,3412,9
Create a LOAD DATA
statement with a REPLACE
clause:
LOAD DATA INFILE '/order_files/orders.csv' REPLACE INTO TABLE cust FIELDS TERMINATED BY ',';
As LOAD DATA
ingests the data from orders.
into the cust
table, it encounters the fifth and sixth records in the file, which contain the duplicate keys 8301
and 3412
.cust
), are replaced with the fifth and second records.
SELECT * FROM cust ORDER BY name;
+--------+------+--------+
| name | id | orders |
+--------+------+--------+
| Chris | 7214 | 6 |
| George | 3412 | 9 |
| Rachel | 9125 | 2 |
| Susan | 8301 | 7 |
+--------+------+--------+
Note
If you want to see more examples of loading data with vectors, refer to How to Bulk Load Vectors.
Loading a Fixed Length File
This example demonstrates how to load the contents of the file fixed_
, whose contents are shown below.
APE602020-06-01
TR 252019-08-07
HSW8 2019-10-11
YTR122020-09-02
LOAD DATA
inserts each extracted row from fixed_
into the table foo
.
CREATE TABLE foo(a CHAR(3), b INT, c DATETIME);
Run the LOAD DATA
statement:
LOAD DATA INFILE '/fixed_length.csv'INTO TABLE foo (@current_row)SET a = TRIM(SUBSTR(@current_row,1,3)),b = TRIM(SUBSTR(@current_row,4,2)),c = TRIM(SUBSTR(@current_row,6,10));
SUBSTR()
extracts a substring from a string and TRIM()
removes the padding (spaces in this case) from the beginning and the ending of a string.LOAD DATA
statement extracts the line HSW8 2019-10-11
in fixed_
, it does the following to set b
: * It extracts, from HSW8 2019-10-11
, the substring starting at position 4 having a length of 2.8
.8
to yield 8
.
Retrieve the data from foo
:
SELECT * from foo ORDER BY a;
+------+------+---------------------+
| a | b | c |
+------+------+---------------------+
| APE | 60 | 2020-06-01 00:00:00 |
| HSW | 8 | 2019-10-11 00:00:00 |
| TR | 25 | 2019-08-07 00:00:00 |
| YTR | 12 | 2020-09-02 00:00:00 |
+------+------+---------------------+
Loading Data using Hex Field Terminator Syntax
Loading data into a table can be performed using a hexadecimal field terminator.
Syntax
CREATE TABLE <table name>(a int, b int);LOAD DATA infile "/tmp/<file name>.csv"INTO TABLE test (a, b) fields terminated by 0x2c;SELECT * FROM <table name>;
**
+------+------+
| a | b |
+------+------+
| 1 | 2 |
+------+------+
Contents of the hex_
was created via the CLI using the following commands:
echo 1. 2 > hex_test.csv
hexdump -C hex_test.csv
00000000 31 2e 20 32 0a |1. 2.|
00000005
JSON LOAD DATA
Syntax
LOAD DATA [LOCAL] INFILE 'file_name'[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]INTO TABLE tbl_nameFORMAT JSONsubvalue_mapping[SET col_name = expr,...][WHERE expr,...][MAX_ERRORS number][ERRORS HANDLE string]subvalue_mapping:( {col_name | @variable_name} <- subvalue_path [DEFAULT literal_expr], ...)subvalue_path:{% | [%::]ident [::ident ...]}
Semantics
Error Logging and Error Handling are discussed at the end of this topic.
Extract specified subvalues from each JSON value in file_
.tbl_
, or to variables used for a column assignment in a SET
clause.DEFAULT
clause literal instead.WHERE
clause.
To specify the compression type of an input file, use the COMPRESSION
clause.
The file named by file_
must consist of concatenated UTF-8 encoded JSON values, optionally separated by whitespace.
Non-standard JSON values like NaN
, Infinity
, and -Infinity
must not occur in file_
.
If file_
ends in .
or .
, it will be decompressed.
JSON LOAD DATA
supports a subset of the error recovery options allowed by CSV LOAD DATA
.
Like CSV LOAD DATA
, JSON LOAD DATA
allows you to use globbing to load data from multiple files.
Writing to multiple databases in a transaction is not supported.
Extracting JSON Values
subvalue_
specifies which subvalues are extracted and the column or variable to which each one is assigned.
LOAD DATA
uses the ::
-separated list of keys in a subvalue_
to perform successive key lookups in nested JSON objects, as if applying the ::
SQL operator.::
operator, subvalue_
may not be used to extract an element of a JSON array.%
refers to the entire JSON value being processed.%::
may be omitted from paths which are otherwise non-empty.
If a path can’t be found in an input JSON value, then if the containing element of subvalue_
has a DEFAULT
clause, its literal_
will be assigned; otherwise, LOAD DATA
will terminate with an error.
Path components containing whitespace or punctuation must be surrounded by backticks.%::`a.
and `a.
will both extract 1
from the input object {"a.
.
Array elements may be indirectly extracted by applying JSON_
in a SET
clause.
Converting JSON Values
Before assignment or set clause evaluation, the JSON value extracted according to a subvalue_
is converted to a binary collation SQL string whose value depends on the extracted JSON type as follows:
JSON Type |
Converted Value |
---|---|
|
SQL |
|
|
|
Verbatim, from extracted string. |
|
All JSON string escape sequences, including escape sequences are converted to UTF-8. |
|
Verbatim, from extracted string. |
|
Verbatim, from extracted string. |
Conversion is not recursive.true
is not converted to "1"
when it is a subvalue of an object which is being extracted whole.
JSON LOAD DATA Examples
To use an ENCLOSED BY <char>
as a terminating field, a TERMINATED BY
clause is needed.ENCLOSED BY <char>
appearing within a field value can be duplicated, and they will be understood as a singular occurrence of the character.
If an ENCLOSED BY ""
is used, quotes are treated as follows:
-
"The ""NEW"" employee" → The "NEW" employee
-
The "NEW" employee → The "NEW" employee
-
The ""NEW"" employee → The ""NEW"" employee
Example 1
If example.
consists of:
{"a":{"b":1}, "c":null}{"a":{"b":2}, "d":null}
Then it can be loaded as follows:
CREATE TABLE t(a INT);LOAD DATA LOCAL INFILE "example.json" INTO TABLE t(a <- a::b) FORMAT JSON;SELECT * FROM t;
+------+
| a |
+------+
| 1 |
| 2 |
+------+
Example 2
If example2.
consists of:
{"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}}{"b":false}"hello"
Then we can perform a more complicated LOAD DATA
:
CREATE TABLE t(b bool NOT NULL, s TEXT, n DOUBLE, a INT, o JSON NOT NULL, whole longblob);LOAD DATA LOCAL INFILE "example2.json" INTO TABLE t FORMAT JSON(b <- b default true,s <- s default NULL,n <- n default NULL,@avar <- a default NULL,o <- o default '{"subobject":"replaced"}',whole <- %)SET a = json_extract_double(@avar, 1)WHERE b = true;SELECT * FROM t;
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| b | s | n | a | o | whole |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| 1 | A®"A | -1.4820790816978637e-25 | 2 | {"subobject":1} | {"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}} |
| 1 | NULL | NULL | NULL | {"subobject":"replaced"} | hello |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
There are several things to note in the example above:
-
true
was converted to"1"
for columnsb
, but not for columnwhole
."1"
was further converted to theBOOL
value1
. -
The escapes
"\u00AE"
and"\u0022"
were converted to UTF-8 for columns
, but not for columnwhole
.Note that whole
would have become invalid JSON if we had translated"\u0022"
. -
The second row was discarded because it failed to match the
WHERE
clause. -
None of the paths in
subvalue_
could be found in the third row, somapping DEFAULT
literals like'{"subobject":"replaced"}'
were assigned instead. -
We assigned
a
to an intermediate variable so that we could extract an array element in theSET
clause. -
The
top-level
JSON values inexample2.
were not all JSON objects.json "hello"
is a validtop-level
JSON value.
Loading JSON Data from a CSV File
To use an ENCLOSED BY <char>
as a terminating field, a TERMINATED BY
clause is needed.ENCLOSED BY <char>
appearing within a field value can be duplicated, and they will be understood as a singular occurrence of the character.
If an ENCLOSED BY ""
is used, the quotes are treated as follows:
-
"The ""New"" employee" → The "NEW" employee
-
The "New" employee → The "NEW" employee
-
The ""NEW"" employee → The ""NEW"" employee
Example 1
An ENCLOSED BY
clause is required when a csv file has a JSON column enclosed with double quotation marks (" ").
CREATE TABLE employees(emp_id int, data JSON);
csv file contents
emp_id,data
159,"{""name"": ""Damien Karras"", ""age"": 38, ""city"": ""New York""}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employeesFIELDS TERMINATED BY ','ENCLOSED BY '"'IGNORE 1 LINES;SELECT * FROM employees;
+--------+-----------------------------------------------------+
| emp_id | data |
+--------+-----------------------------------------------------+
| 159 | {"age":38,"city":"New York","name":"Damien Karras"} |
+--------+-----------------------------------------------------+
Example 2
An ESCAPED BY
clause is required when a character is specified as an escape character for a string.
csv file contents
emp_id,data
298,"{\"name\": \"Bill Denbrough\", \"age\": 25, \"city\": \"Bangor\"}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employeesINTO TABLE employeesFIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'IGNORE 1 LINES;SELECT * FROM employees;
+--------+-----------------------------------------------------+
| emp_id | data |
+--------+-----------------------------------------------------+
| 298 | {"age":25,"city":"Bangor","name":"Bill Denbrough"} |
| 159 | {"age":38,"city":"New York","name":"Damien Karras"} |
+--------+-----------------------------------------------------+
Example 3
This example will fail as the JSON field in the csv file is not in the correct format.
csv file contents
emp_id,data
410,"{"name": "Annie Wilkes", "age": 45, "city":"Silver Creek"}"
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employeesFIELDS TERMINATED BY ','ENCLOSED BY '{'IGNORE 1 LINES;
ERROR 1262 (01000): Leaf Error (127.0.0.1:3307): Row 1 was truncated; it contained more data than there were input columns
Example 4
An ENCLOSED BY
clause is required when a csv file has a JSON column enclosed with curly brackets ({ }).
csv file contents
emp_id,data
089,{"name": "Wilbur Whateley","age": 62,"city": "Dunwich"}
LOAD DATA INFILE '/tmp/<file_name>.csv' INTO TABLE employeesFIELDS TERMINATED BY ','ENCLOSED BY '{'IGNORE 1 LINES;SELECT * FROM employees;
+--------+------------------------------------------------------+
| emp_id | data |
+--------+------------------------------------------------------+
| 298 | {"age":25,"city":"Bangor","name":"Bill Denbrough"} |
| 159 | {"age":38,"city":"New York","name":"Damien Karras"} |
| 89 | {"age":62,"city":"Dunwich","name":"Wilbur Whateley"} |
+--------+------------------------------------------------------+
Avro LOAD DATA
Syntax for LOAD DATA Local Infile
LOAD DATA [LOCAL] INFILE 'file_name'WHERE/SET/SKIP ERRORS[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]INTO TABLE tbl_nameFORMAT AVRO SCHEMA REGISTRY {"IP" | "Hostname"}subvalue_mapping[SET col_name = expr,...][WHERE expr,...][MAX_ERRORS number][ERRORS HANDLE string][SCHEMA 'avro_schema']subvalue_mapping:( {col_name | @variable_name} <- subvalue_path, ...)subvalue_path:{% | [%::]ident [::ident ...]}
See the associated GitHub repo.
Syntax for LOAD DATA AWS S3 Source
Avro-formatted data stored in an AWS S3 bucket can use a LOAD DATA query without a pipeline.
LOAD DATA S3 '<bucket name>'CONFIG '{"region" : "<region_name>"}'CREDENTIALS '{"aws_access_key_id" : "<key_id> ","aws_secret_access_key": "<access_key>"}'INTO TABLE <table_name>(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,) FORMAT AVRO;
Semantics
Error Logging and Error Handling are discussed at the end of this topic.
LOAD DATA
for Avro does not support file name globbing (for example: LOAD DATA INFILE '/data/nfs/gp1/*.
).LOAD DATA
for Avro only supports loading a single file per statement.
Extract specified subvalues from each Avro value in file_
.tbl_
, or to variables used for a column assignment in a SET
clause.WHERE
clause.
To specify the compression type of an input file, use the COMPRESSION
clause.
Avro LOAD DATA
expects Avro data in one of two sub-formats
, depending on the SCHEMA
clause.
If no SCHEMA
clause is provided, file_
must name an Avro Object Container File as described in version 1.
-
The compression codec of the file must be
null
. -
Array and map values must not have more than 16384 elements.
-
The type name of a
record
must not be used in a symbolicreference to previously defined name
in any of its fields.It may still be used in a symbolic reference outside the record definition, however. For example, self-referential schemas like the following are rejected by
LOAD DATA
:{"type": "record","name": "PseudoLinkedList","fields" : [{"name": "value", "type": "long"},{"name": "next", "type": ["null", "PseudoLinkedList"]}]}
If a SCHEMA
clause is provided, the file must be a raw stream
consisting of only the concatenated binary encodings of instances of avro_
.avro_
must be a SQL string containing a JSON Avro schema.raw stream
files.
Warning
It’s an error to provide a SCHEMA
clause when loading an Object Container File because it contains metadata alongside the encoded values.
All optional Avro schema attributes except the namespace
attribute are ignored.logicalType
attributes are ignored.
If file_
ends in .
or .
, it will be decompressed.
Avro LOAD DATA
supports a subset of the error recovery options allowed by CSV LOAD DATA
.
Writing to multiple databases in a transaction is not supported.
The SCHEMA REGISTRY {"IP" | "Hostname"}
option allows LOAD DATA
to pull the schema from a schema registry.
Extracting Avro Values
subvalue_
specifies which subvalues are extracted and the column or variable to which each one is assigned.
LOAD DATA
uses the ::
-separated list of names in a subvalue_
to perform successive field name or union branch type name lookups in nested Avro records or unions.subvalue_
may not be used to extract elements of Avro arrays or maps.%
refers to the entire Avro value being processed.%::
may be omitted from paths which are otherwise non-empty.
If a path can’t be found in an input Avro value, then: * If a prefix of the path matches a record whose schema has no field matching the next name in the path, then LOAD DATA
will terminate with an error.LOAD DATA
will terminate with an error.null
will be extracted instead and LOAD DATA
will continue.
Path components naming union branches must use the two-part fullname of the branch’s type if that type is in a namespace.
Path components containing whitespace or punctuation must be surrounded by backticks.
Array and map elements may be indirectly extracted by applying JSON_
in a SET
clause.
For example, consider two Avro records with the union schema:
["int",{ "type" : "record","name" : "a","namespace" : "n","fields" : [{ "name" : "f1","type" : "int" }]}]
The paths %::`n.
and `n.
will both extract 1
from an instance of this schema whose JSON encoding is {"n.
.
They will extract null
from an instance whose encoding is {"int":2}
.
The paths %::int
and int
will extract 2
from the second instance and null
from the first.
Converting Avro Values
Before assignment or set clause evaluation, the Avro value extracted according to a subvalue_
is converted to an unspecified SQL type which may be further explicitly or implicitly converted as if from a SQL string whose value is as follows:
Avro Type |
Converted Value |
---|---|
|
SQL |
|
|
|
The string representation of the value |
|
The string representation of the value |
|
SQL |
|
SQL |
|
The string representation of the enum. |
|
Verbatim, from input bytes |
|
Verbatim, from input bytes |
|
Verbatim, from input bytes |
|
|
|
|
|
|
|
logicalType
attributes are ignored and have no effect on conversion.
Avro LOAD DATA Examples
Example 1
Consider an Avro Object Container Fileexample.
with the following schema:
{"type": "record","name": "data","fields": [{ "name": "id", "type": "long"},{ "name": "payload", "type": [ "null","string" ]}]}
example.
contains three Avro values whose JSON encodings are:
{"id":1,"payload":{"string":"first"}}{"id":1,"payload":{"string":"second"}}{"id":1,"payload":null}
example.
can be loaded as follows:
CREATE TABLE t(payload TEXT, input_record JSON);LOAD DATA LOCAL INFILE "example.avro"INTO TABLE tFORMAT AVRO( payload <- %::payload::string,input_record <- % );SELECT * FROM t;
+---------+----------------------------------------+
| payload | input_record |
+---------+----------------------------------------+
| first | {"id":1,"payload":{"string":"first"}} |
| second | {"id":1,"payload":{"string":"second"}} |
| NULL | {"id":1,"payload":null} |
+---------+----------------------------------------+
LOAD DATA
was able to parse example.
because Avro Object Container Files have a header which contains their schema.
Example 2
Consider a file named example.
, with the same values as example.
from Example 1 but in the raw stream format
.example.
consists of the binary encoded values and nothing else.SCHEMA
clause to tell LOAD DATA
to expect a raw stream
with the provided schema:
CREATE TABLE t(payload TEXT, input_record JSON);LOAD DATA LOCAL INFILE "example.raw_avro"INTO TABLE tFORMAT AVRO( payload <- %::payload::string,input_record <- % )schema'{"type": "record","name": "data","fields": [{ "name": "id", "type": "long"},{ "name": "payload", "type": [ "null", "string" ]}]}';SELECT * FROM t;
+---------+----------------------------------------+
| payload | input_record |
+---------+----------------------------------------+
| first | {"id":1,"payload":{"string":"first"}} |
| second | {"id":1,"payload":{"string":"second"}} |
| NULL | {"id":1,"payload":null} |
+---------+----------------------------------------+
Example 3
Consider an Object Container Fileexample3.
with a more complicated payload than Example 1.
{ "type": "record","namespace": "ns","name": "data","fields": [{ "name": "id", "type": "long" },{ "name": "payload", "type":[ "null",{ "type": "record","name": "payload_record","namespace": "ns","fields": [{ "name": "f_bytes", "type": "bytes"},{ "name": "f_string", "type": "string"},{ "name": "f_map", "type":{ "type": "map","values": { "type": "array","items": "int" }}}]}]}]}
The raw JSON encoding of the contents of this file can be seen in column c_
after the following LOAD DATA
:
CREATE TABLE t (c_id bigint,c_bytes longblob,c_string longblob,c_array_second int,c_whole_raw longblob,c_whole_json json);LOAD DATA INFILE "example3.avro"INTO TABLE tFORMAT AVRO( c_id <- %::id,c_bytes <- %::payload::`ns.payload_record`::f_bytes,c_string <- %::payload::`ns.payload_record`::f_string,@v_map <- %::payload::`ns.payload_record`::f_map,c_whole_raw <- %,c_whole_json <- %)SET c_array_second = JSON_EXTRACT_JSON(@v_map, "a", 1);SELECT * FROM t;
*** 1. row ***
c_id: 1
c_bytes: NULL
c_string: NULL
c_array_second: NULL
c_whole_raw: {"id":1,"payload":null}
c_whole_json: {"id":1,"payload":null}
*** 2. row ***
c_id: 2
c_bytes: "A
c_string: "A
c_array_second: 2
c_whole_raw: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\u0022\u0041","f_string":"\"A","f_map":{"a":[1,2]}}}}
c_whole_json: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\"A","f_map":{"a":[1,2]},"f_string":"\"A"}}}
There are several things to note:
-
We attempted to extract subvalues of the
payload_
branch of the union-typerecord payload
field.Since that wasn’t the selected member of the union in record 1, LOAD DATA
assignedNULL
toc_
andbytes @v_
.map -
We assigned the JSON encoding of
f_
tomap @v_
and then performed JSON map and array lookups in themap SET
clause to ultimately extract2
. -
f_
andstring f_
had the same contents, but we can see how their different Avro types affected their JSON encodings and interacted with the SQL JSON typebytes -
The JSON encoding of the Avro
string
valuef_
, as seen instring c_
, encodes special characters likewhole_ raw "
as the escape sequence\"
. -
The JSON encoding of the Avro
bytes
valuef_
, as seen inbytes c_
, encodes every byte with a JSON escape.whole_ raw -
When converting the JSON encoding of record 2 to the SQL JSON type while assigning to
c_
,whole_ json LOAD DATA
normalized both representations of the byte sequence"A
to\"A
.
-
Loading Parquet Data
The LOAD DATA
command supports loading Parquet files from AWS S3 or local files.LOAD DATA
clause in a CREATE PIPELINE .
Syntax for LOAD DATA AWS S3 or Local File Source
Parquet-formatted data stored in an AWS S3 bucket or the local filesystem can be loaded via a LOAD DATA query without a pipeline.LOAD DATA
clauses (SET
, WHERE
, etc.
For S3:
LOAD DATA S3 '<bucket name>'CONFIG '{"region" : "<region_name>"}'CREDENTIALS '{"aws_access_key_id" : "<key_id> ","aws_secret_access_key": "<access_key>"}'INTO TABLE <table_name>(`<col_a>` <- %,`<col_b>` <- % DEFAULT NULL ,) FORMAT PARQUET;
For local file:
LOAD DATA INFILE '<path_to_file/file_name>'INTO TABLE <table_name>(val1 <- source1,val2 <- source2[