Extracting or Manipulating Substrings
Manipulating substrings may be necessary for different reasons. The typical ones are:
- cleaning data
- preparing data for aggregation or grouping
Data cleaning may include removing or replacing characters to repair a string that looks like mistyped, as well as removing invalid data that should not have been there. You may be particularly interested in analyzing only part of a text field, such as the domain name of an email address. Etleap Wrangler offers two main types of transformations for manipulating strings:
- based on another string or a regular expression (regex )
- based on position
Wrangler offers various transforms that differ in their outcomes. Some of them work with both regex and positional numbers; some allow only one of them.
The Complete List of String Transformations
- Extract from a column
- Cut a column
- Split a column
- Edit text in columns
The table below provides a summary of available transforms and their use cases.
Purpose | Operation | Method(s) | Wrangler steps(s) |
---|---|---|---|
Data cleaning | Trimming | regex and position | Cut column, Edit text |
Data cleaning | Replacement | regex and position | Edit text |
Data aggregation | Matching | regex | Extract from a column, Cut a column, Split a column |
Keeping the Original Column
Some transforms will keep the original column; some do not. The table below provides an overview:
Wrangler step | Original column is kept? |
---|---|
Extract from a column | Remains unchanged; a new column is created for the results |
Cut a column | Is changed; transformations happen directly in the original column |
Split a column | Is dropped; a new column is created for the results |
Edit text in columns | Is changed; transformations happen directly in the original column |
Transform String Columns Using String or Regex
The Max Field
All transforms that use regex have a “Max” field which allows you to control the maximum number of matches to be found. For instance, if you set this parameter to “Once”, and there are two possible matches, the transform will catch only the first one.
Note
Although this is rarely the case, if you have very inconsistent data in the string column, and do not know how many times the substring will show up there, use “Other” or “Repeatedly” and set the maximum split a bit beyond your actual expectations. This may create empty columns, but you can always drop them afterward. It prevents you from losing any data.
Extract Strings
Extract from a column will match the expression with the string values in the column and put the matches in a new column.

Cut Strings
Cut columns will remove the matching phrases from the column. The records in the original column that have no matches will remain unchanged.

Split Strings
Split a column will treat each regex match as a delimiter: a point where a string should be divided. The delimiter won’t be saved anywhere in the dataset after the transformation is performed.
The results can be stored as either columns or rows.

Splitting a Column Into Multiple Rows
If you choose to save the results as rows, the original record will be duplicated for each match, with only the split column having different values. The values in other columns will be the same for every new row. This is helpful if you’ve got a list of integers or strings in a column and want to unfold it.

Dealing With Quotes Within the Strings
Split a column once allows you to exclude quotes from regex matching. The “quote” character is the one that surrounds a quote. When a phrase is quoted, then separators that appear within the quotes are not treated as separators.

Edit Strings
Edit text can trim strings or modify them by replacing a pattern with another string. It requires a definition of the target replacement in the “To” field. You can leave it empty to shorten strings or specify a substring to edit the original string.

Predefined Expressions
Check the dropdown menu of the “On” field to find out which regular expressions are available in the Wrangler. We prepared a few particularly useful ones, such as “IP address” and “New line”. To create your custom ones, use “Regular Expression” option from the dropdown.
Extract Substrings by Position Number
You can extract a substring by specifying where it should start and end in the original string with position numbers.
Extract Strings
Extract from a column will take the substring between the start and end position and copy it into a new column.

Cut Strings
Cut a column once will remove everything except a part of the original string which is not between the start and the end position.

Split Strings
Split column will break the original string into multiple ones bases on the end position. It can deliver results as either columns or rows but won’t keep the original column. The positional argument in “Split column” must be specified as a number or list of numbers separated with a comma. The first number in the row defines the end position for the first substring; it takes a part of the original string from the beginning and to the end position inclusively. The second number in the row defines the end position of the second piece, but this piece will only start where the first one ended. And so on with the third and fourth parts, etc. When the Wrangler arrives at the end of the row, it places the remaining part of the original string into a separate column.
