Loading Deletes

Original Lake Format

Deletes files contain the Primary Key combinations of rows that are set to be deleted by a certain activity bundle. These files copied directly to the base directory for a loaded transformation. The deletes file name format depends on the Output Format configured for the Pipeline.

Output Format	Single Deletes File	Multiple Deletes Files
CSV	_deletes.csv.gz	_deletes.csv.gz, _deletes2.csv.gz, _deletes3.csv.gz, …
Parquet	deletes.snappy.parquet	deletes.snappy.parquet, deletes2.snappy.parquet, deletes3.snappy.parquet, …

The deletes files will be copied to the load base directory:

s3://bucket/base/dir/pipeline_path/vX/{loadtime}/_deletesX.csv.gz

New Lake Format

If the Output Format of the pipeline does not support deletes partitioning, then the deletes scheme described in Original Lake Format will be used. Otherwise, a deletes partition will be created and deletes files will be copied to the partition directory, where the partition value is “true” for deletes files and “false” for data files.

Output Format	Supports Deletes Partitioning?
CSV	YES
Parquet	NO

The deletes files will be copied to the following directory:


s3://bucket/base/dir/{pipeline_path}/vX/{loadtime}/true/{transformation_uuid}-{original_filename}.snappy.parquet

While data files will be copied to a directory like:


s3://bucket/base/dir/{pipeline_path}/vX/{explicit_partitions}/{loadtime}/false/{transformation_uuid}-{original_filename}.snappy.parquet

Interaction with Glue Catalogs

The deletes files will be copied to the S3 destination as described above regardless of whether a Glue Catalog is being created for the load. However, as CSV and Glue are not completely compatible, we now prevent pipelines from being created with a CSV output type if the destination is an S3 Data Lake with a Glue database defined on the connection.