Loading Deletes
Original Lake Format
Deletes files contain the Primary Key combinations of rows that are set to be deleted by a certain activity bundle. These files copied directly to the base directory for a loaded transformation. The deletes file name format depends on the Output Format configured for the Pipeline.
| Output Format | Single Deletes File | Multiple Deletes Files |
|---|---|---|
| CSV | _deletes.csv.gz | _deletes.csv.gz, _deletes2.csv.gz, _deletes3.csv.gz, … |
| Parquet | deletes.snappy.parquet | deletes.snappy.parquet, deletes2.snappy.parquet, deletes3.snappy.parquet, … |
The deletes files will be copied to the load base directory:
s3://bucket/base/dir/pipeline_path/vX/{loadtime}/_deletesX.csv.gz
New Lake Format
If the Output Format of the pipeline does not support deletes partitioning, then the deletes scheme described in Original Lake Format will be used. Otherwise, a deletes partition will be created and deletes files will be copied to the partition directory, where the partition value is “true” for deletes files and “false” for data files.
| Output Format | Supports Deletes Partitioning? |
|---|---|
| CSV | YES |
| Parquet | NO |
The deletes files will be copied to the following directory:
s3://bucket/base/dir/{pipeline_path}/vX/{loadtime}/true/{transformation_uuid}-{original_filename}.snappy.parquetWhile data files will be copied to a directory like:
s3://bucket/base/dir/{pipeline_path}/vX/{explicit_partitions}/{loadtime}/false/{transformation_uuid}-{original_filename}.snappy.parquetInteraction with Glue Catalogs
The deletes files will be copied to the S3 destination as described above regardless of whether a Glue Catalog is being created for the load. However, as CSV and Glue are not completely compatible, we now prevent pipelines from being created with a CSV output type if the destination is an S3 Data Lake with a Glue database defined on the connection.