Extracting or Manipulating Substrings

Manipulating substrings may be necessary for different reasons. The typical ones are:

cleaning data
preparing data for aggregation or grouping

Data cleaning may include removing or replacing characters to repair a string that looks like mistyped, as well as removing invalid data that should not have been there. You may be particularly interested in analyzing only part of a text field, such as the domain name of an email address. Etleap Wrangler offers two main types of transformations for manipulating strings:

based on another string or a regular expression (regex )
based on position

Wrangler offers various transforms that differ in their outcomes. Some of them work with both regex and positional numbers; some allow only one of them.

The Complete List of String Transformations

Extract from a column
Cut a column
Split a column
Edit text in columns

The table below provides a summary of available transforms and their use cases.

Purpose	Operation	Method(s)	Wrangler steps(s)
Data cleaning	Trimming	regex and position	Cut column, Edit text
Data cleaning	Replacement	regex and position	Edit text
Data aggregation	Matching	regex	Extract from a column, Cut a column, Split a column

Keeping the Original Column

Some transforms will keep the original column; some do not. The table below provides an overview:

Wrangler step	Original column is kept?
Extract from a column	Remains unchanged; a new column is created for the results
Cut a column	Is changed; transformations happen directly in the original column
Split a column	Is dropped; a new column is created for the results
Edit text in columns	Is changed; transformations happen directly in the original column

Note

Note

If you want to keep the original column, copy it before changing.

Transform String Columns Using String or Regex

The Max Field

All transforms that use regex have a “Max” field which allows you to control the maximum number of matches to be found. For instance, if you set this parameter to “Once”, and there are two possible matches, the transform will catch only the first one.

Note

Note

Although this is rarely the case, if you have very inconsistent data in the string column, and do not know how many times the substring will show up there, use “Other” or “Repeatedly” and set the maximum split a bit beyond your actual expectations. This may create empty columns, but you can always drop them afterward. It prevents you from losing any data.

Extract Strings

Extract from a column will match the expression with the string values in the column and put the matches in a new column.

27_Extracting a substring based on a regular expression.png — Extracting a substring based on a regular expression

Cut Strings

Cut columns will remove the matching phrases from the column. The records in the original column that have no matches will remain unchanged.

28_Cutting a substring from a string.png — Cutting a substring from a string

Split Strings

Split a column will treat each regex match as a delimiter: a point where a string should be divided. The delimiter won’t be saved anywhere in the dataset after the transformation is performed.

The results can be stored as either columns or rows.

29_Splitting a column based on a regular expression.png — Splitting a column based on regular expression

Splitting a Column Into Multiple Rows

If you choose to save the results as rows, the original record will be duplicated for each match, with only the split column having different values. The values in other columns will be the same for every new row. This is helpful if you’ve got a list of integers or strings in a column and want to unfold it.

Dealing With Quotes Within the Strings

Split a column once allows you to exclude quotes from regex matching. The “quote” character is the one that surrounds a quote. When a phrase is quoted, then separators that appear within the quotes are not treated as separators.

Edit Strings

Edit text can trim strings or modify them by replacing a pattern with another string. It requires a definition of the target replacement in the “To” field. You can leave it empty to shorten strings or specify a substring to edit the original string.

32_Replacing a substring based on a regex.png — Replacing a substring based on a regex

Predefined Expressions

Check the dropdown menu of the “On” field to find out which regular expressions are available in the Wrangler. We prepared a few particularly useful ones, such as “IP address” and “New line”. To create your custom ones, use “Regular Expression” option from the dropdown.

Extract Substrings by Position Number

You can extract a substring by specifying where it should start and end in the original string with position numbers.

Extract Strings

Extract from a column will take the substring between the start and end position and copy it into a new column.

33_Extract a substring by position number.png — Extract a substring by position number

Note

Note

Make sure to use “Once” in the Max section when using position-based extracting to avoid creating empty columns.

Cut Strings

Cut a column once will remove everything except a part of the original string which is not between the start and the end position.

34_Extracting a substring by cutting out a part of the original string using position numbers.png — Extracting a substring by cutting out a part of the original string using position numbers

Split Strings

Split column will break the original string into multiple ones bases on the end position. It can deliver results as either columns or rows but won’t keep the original column. The positional argument in “Split column” must be specified as a number or list of numbers separated with a comma. The first number in the row defines the end position for the first substring; it takes a part of the original string from the beginning and to the end position inclusively. The second number in the row defines the end position of the second piece, but this piece will only start where the first one ended. And so on with the third and fourth parts, etc. When the Wrangler arrives at the end of the row, it places the remaining part of the original string into a separate column.

35_Splitting a column based on positional numbers.png — Splitting a column based on positional numbers