Process text column

Author

Cox Lab

Published

June 27, 2024

===== General =====

===== Brief description =====

Values in string columns can be manipulated according to a regular expression.

1 Parameters

1.1 Columns

Selected text columns, whose values should be manipulated by the defined regular expression (default: no columns are selected).

1.2 Regular expression

Specified regular expression that is applied to the selected text columns (default: “^([^;]+)”).

A regular expression is a sequence of characters that forms a search pattern with a special syntax. A good general introduction can be found, as always, on Wikipedia. If you already know generally how regular expressions work, you may only need to glance at the quick reference or at an even quicker one.

Here are a few examples:

Regular expression Effect
^([^;]+) Select all the characters from the beginning of the line, up to but not including the first semicolon. This is the default.
TAG = ([^,; ]*) Look for the first instance of “TAG =”, with any amount of whitespace (or none) around the equal sign, and return what follows after the whitespace until a comma or semicolon is reached.
([ACTG]+) return the first string consisting only of the letters A, C, T, and G.
(20[01][0-9]-[01][0-9]-[0-3][0-9]) Select a date between 2000 and 2019 of the form 2014-08-19.

1.3 Replacement string

You can provide a replacement string here for more flexibility. Leave empty if unsure.

Examples:

Regular expression Effect
$1 Replace the original string with the first capture group, i.e. the part of the original string inside the first parentheses ’‘(..)’’.

1.4 Keep original columns

If checked, the original columns are retained unchanged, and new columns are appended to hold the results (default: unchecked). The name of a new column is created by appending underscores to the name of the original column until it is unique. If this box is not checked, then the strings in the original columns are overwritten by the results.

1.5 Strings separated by semicolons are independent

If checked, each string is split into substrings at the semicolons, and the regular expression is applied independently to each substring (default: unchecked). The results are separated by semicolons and concatenated into a single string, which is returned. This is useful for columns where any row may contain multiple entries. If not checked, the string is evaluated as a whole and the only first match returned.

2 Parameter window

Process text column