Data Partitioning

In data lakes it’s often useful to partition incoming data based on attributes in the data itself. Here’s how to do that in Etleap:

In the wrangler, click ‘Add Step’, and pick ‘Misc’.
In the ‘Operation’ dropdown pick ‘Partition Output’.
In the ‘Column’ input, enter the column(s) you want to partition by. Multiple columns can be specified as a comma-separated list.

Loading image...

Click ‘Add’ and ‘Next’.
Pick your S3 Data Lake destination.
On the following page, you will see that in the S3 path template there’s a placeholder for the partition column in the S3 path, which will be replaced by the partition value. Note that this placeholder comes just after the pipeline version indicator and before the load time partitioning.

Loading image...

By default, Etleap adds up to 2 extra partitions:

_loaddate which is the date when data was loaded, in ISO8061 basic format
_id_deleted, which is enabled only for Parquet Output, and is true when the primary keys are deleted from source, when we support extracting deletes (e.g. Salesforce or Database CDC).

Loading Deletes Loading Manifest Files