Loading Deletes
Original Lake Format
Deletes files contain the Primary Key combinations of rows that are set to be deleted by a certain activity bundle. These files copied directly to the base directory for a loaded transformation. The deletes file name format depends on the Output Format configured for the Pipeline.
Output Format | Single Deletes File | Multiple Deletes Files |
---|---|---|
CSV | _deletes.csv.gz | _deletes.csv.gz, _deletes2.csv.gz, _deletes3.csv.gz, … |
Parquet | deletes.snappy.parquet | deletes.snappy.parquet, deletes2.snappy.parquet, deletes3.snappy.parquet, … |
The deletes files will be copied to the load base directory:
s3://bucket/base/dir/pipeline_path/vX/{loadtime}/_deletesX.csv.gz
New Lake Format
If the Output Format of the pipeline does not support deletes partitioning, then the deletes scheme described in Original Lake Format will be used. Otherwise, a deletes partition will be created and deletes files will be copied to the partition directory, where the partition value is “true” for deletes files and “false” for data files.
Output Format | Supports Deletes Partitioning? |
---|---|
CSV | YES |
Parquet | NO |
The deletes files will be copied to the following directory:
s3://bucket/base/dir/{pipeline_path}/vX/{loadtime}/true/{transformation_uuid}-{original_filename}.snappy.parquet
While data files will be copied to a directory like:
s3://bucket/base/dir/{pipeline_path}/vX/{explicit_partitions}/{loadtime}/false/{transformation_uuid}-{original_filename}.snappy.parquet
Interaction with Glue Catalogs
The deletes files will be copied to the S3 destination as described above regardless of whether a Glue Catalog is being created for the load. However, as CSV and Glue are not completely compatible, we now prevent pipelines from being created with a CSV output type if the destination is an S3 Data Lake with a Glue database defined on the connection.