Skip to Content
DocumentationWrangling DataRemove Invalid Data

Remove Invalid Data

What Is Invalid Data

Invalid Data demonstrates one or more of the following characteristics:

  • it was incorrectly generated
  • it may be incomplete or look unrelated to the rest of the column
  • it has inconsistent format
  • it is incompatible with the transformation you want to perform

Signs of Invalid Data

Sometimes the Wrangler cannot identify the data type. This usually happens when the same column has values in different formats. The Wrangler then displays the most permissive format, which is typically a string.

There are multiple ways to deal with invalid data, each of them deriving from one of the following choices:

  • replacing invalid data with another value
  • removing it completely and excluding the affected records from the dataset

If the column type is not what you expected, click on the type icon in the column header. A small chart will appear. If a column is mostly of one type but not all, it will show how many records in the sample could not be parsed as that type.

The following example illustrates a possible scenario of dealing with invalid data.

As you can see in the screenshot below, most of the values could be recognized as integers but some could not. Thus, the Wrangler could not assign the type to be integer and chose string instead. Based on our domain knowledge of the column “num_actions”, we know this column should be integer and not string. Let’s explore some options for resolving this.

39_Column with invalid data.png

Replacing Invalid Data

The following methods are all based on the assumption that the data in the affected columns is set as a string type - (1) in the screenshot. In our case, we have a column with mostly integers and a few cells having an underscore instead of a number. The underscores caused the column to be classified as a string format.

40_Replacing Invalid Data.png
Replacing Invalid Data

Edit Text in Columns

You can replace the values that do not fit with the desired column data type with new values in a matching format or replace them with null values. This brings consistency and helps the Wrangler recognize the data type.

Which option you choose depends on what you are going to do with your data afterward.

Sometimes you need null, and sometimes you need 0. These are not the same! If you need to perform aggregations later, using null may exclude the respective rows from an aggregation (for instance, a mean or median), but using 0 will keep them. This can influence the final result of your calculation.

Steps

  1. Find the transform “Edit text in columns” in the Wrangler or type in “edit text” and pick up the top result.
  2. Select the column.
  3. In the dropdown menu, select “Set to value” to specify a replacement or “Set to null” to replace with null.
Replacing Invalid Data With a Different Value

If you choose “Set to value”, you will need to specify the new value as well as a replacement condition. The replacement condition can be set using one of the following options:

  1. The dropdown menu in the “On” field and, in our case, a regular expression.
41_Replacing invalid data with a value based on a regular expression.png
  1. The where clause. Consult the where-clause help (the i icon) to find out about the syntax. Spoiler: it is very similar to SQL!

In the example below, we can even use either of the following conditions:

num_actions = '_'

or

num_actions is not an integer

Both of them will lead to the same result of fixing the non-integers.

42_Replacing invalid data with a value based on a where clause.png
Replacing invalid data with a value based on a where clause
Replacing Invalid Data With Null

With “Set to null”, you have to use a where clause to avoid losing valid data. The reason is that the replacement is applied to every single row in the column. This would convert a whole column to null values, not only the ones you consider invalid. To avoid this, you need to specify which values you want to replace. You can do it using the where clause with the same conditions as in the previous case.

43_Replacing invalid values with null.png
Replacing invalid values with null

Cut Columns

With this, you can remove invalid characters from a string, either leaving the affected cells empty or, if there were valid characters in the cell, leaving only the correct ones. Making the values consistent throughout a column allows the Wrangler to detect the proper type for this column.

Steps

  1. Find the transform “Cut column once” in the script editor or type in “cut” and pick up the top result.
  2. Select the column.
  3. Select a suitable option in the “On” field; in our case, a regular expression.
44_Cut text in a column to fix parsing errors.png
Cut text in a column to fix parsing errors

Before we cut the column, it was recognized as a string data type. After the transformation, the Wrangler has set it to integer, and so it will be written into the destination. Etleap defines the destination table type as the one defined in the Wrangler.

45_Correct data type after applying the cut function.png
Correct data type after applying the cut transform
Note

Note

The functions cut column and edit text — will change the original column with invalid data instead of creating a new one.

Removing Invalid Records

Sometimes you cannot use records with missing data for calculations, regardless of whether you replace invalid values with anything else. In such a situation, it may make sense to get rid of them completely by deleting the rows with invalid data.

Delete Rows

You can use the “Delete Rows Where” transform to remove invalid records.

Steps

  1. Find the step “Delete Rows Where” in the Wrangler or type in “delete rows” and pick up the top result.
  2. Select the column.
  3. Compile a suitable where clause. Remember to use the column name with invalid data in the condition.
46_removing the rows with invalid data from the dataset.png
Removing the rows with invalid data from the dataset