File Types
Etleap supports ingesting a wide selection of file types. This page enumerates most commonly used file types and formats that Etleap supports. The Wrangler is able to infer the file types and format from a sample of the input files.
Avro
Etleap supports reading from any Avro file, produced by any software as long as the Avro schema in the file complies with the Apache Avro specification .
The Wrangler automatically detects this file format and uses the Parse as Avro transform.
Compressed Files
Compressed file formats such as .zip and .gz will be read and interpreted based on the data format inside the file.
Etleap currently only supports a single compressed file and not a compressed file directory that contains multiple files. If there are multiple files in compressed archive, Etleap will arbitrarily select a single file, which may lead to incomplete data in the destination.
CSV and TSV
A file with extension .csv or .tsv and data that is split into rows by using newlines and into columns by using commas or tabs.
The first transformation step will be Parse CSV.
Excel
Files with the extensions .xls or .xlsx are recognized as Excel files in the file picker. You can choose the sheets from which you want to extract data. Unless additional sheets are selected, Etleap will extract data from only the first sheet. Additionally, if a glob or regex pattern is used and matches an Excel file, only the first sheet is selected.
The first transformation step will be Parse as Excel.
JSON Arrays
A textual file containing a JSON array of JSON objects (e.g. [{"foo":"bar}, {"foo":"foo"}]
).
This is particularly useful when input files contain multi-line JSON objects.
The first transformation step will be Parse as JSON array.
For more details see: JSON Parsing
JSON Data
Any file containing lines of JSON objects delimited by newlines. Both flat and nested JSON objects are supported.
The first transformation steps will be Split data repeatedly on newline and Flatten nested JSON object in data.
For more details see: JSON Parsing
For this file format a single line must contain a single JSON object, multi-line JSON objects are not supported.
Other Plain-Text Formats
Any other Plain-Text file such as .txt. The wrangler will select the most suitable default transformation step.
Parquet
Etleap supports reading from any Parquet file, produced by any software as long as the Parquet schema in the file complies with the Apache Parquet specification . This also includes any compression formats such as Snappy.
Wrangler automatically detects this file format and uses Parse as Parquet transform.
Parquet Type Mapping
The table below describes how Parquet types are mapped to Etleap types .
Parquet Logical Type | Parquet Physical Type | Etleap Type | Notes |
---|---|---|---|
null | boolean | BOOLEAN | |
null | int32 | BIGINT | |
null | int64 | BIGINT | |
null | int96 | STRING | Base64 encoding of the binary value |
null | float | DOUBLE | |
null | double | DOUBLE | |
null | fixed_len_byte_array | STRING | Base64 encoding of the binary value |
null | binary | STRING | Base64 encoding of the binary value |
STRING | binary | STRING | Interpreted as UTF-8 |
DATE | int32 | DATE | Date string in the form “YYYY-MM-DD” |
TIMESTAMP: precision MILLIS | int64 | DATETIME | Date string in the form “YYYY-MM-DDTHH:mm:ss.SZ” |
TIMESTAMP: precision MICROS | int64 | DATETIME | Date string in the form “YYYY-MM-DDTHH:mm:ss.SZ” |
TIMESTAMP: precision NANOS | int64 | DATETIME | Date string in the form “YYYY-MM-DDTHH:mm:ss.SZ” |
TIME: precision MILLIS | int64 | STRING | Time String in the format “HH:mm:ss.sss” |
TIME: precision MICROS | int64 | STRING | Time String in the format “HH:mm:ss.sss” |
TIME: precision NANOS | int64 | STRING | Time String in the format “HH:mm:ss.sss” |
INTERVAL | fixed_len_byte_array(12) | STRING | String representation of the interval in the format “HH:mm:ss.sss” |
INT: signed, precision 8, 16, 32, int64 | int8, int16, int32, int64 | BIGINT | |
INT: unsigned, precision 8, 16, 32 | int8, int16, int32 | BIGINT | |
INT: unsigned, precision 64 | int64 | DECIMAL(20,0) | Unsigned 64 bit precision integer is too wide to fit into a standard BIGINT type |
DECIMAL: any precision | fixed_len_byte_array | DECIMAL(precision, scale) | |
DECIMAL: precision <= 9 | int32 | DECIMAL(precision, scale) | |
DECIMAL: 9 < precision <= 18 | int64 | DECIMAL(precision, scale) | |
DECIMAL: any precision | fixed_len_byte_array, binary | DECIMAL(precision, scale) | |
ENUM | binary | STRING | Interpreted as a UTF-8 String |
UUID | fixed_len_byte_array(16) | STRING | String formatting of the UUID: 00112233-4455-6677-8899-aabbccddeeff |
JSON | binary | STRING | String containing the JSON object |
BSON | binary | STRING | String representation of the BSON object: {key=>value, ....} |
LIST | - | STRING | JSON array containing the objects or primitive values |
MAP | - | STRING | JSON object containing the key-value pairs |
XML
Files with file extension .xml
and following XML specification .
First transformation step: Parse XML
.