#
Tokenizer Reference
Tokenizers are used by some matchers to split a single text into multiple parts, allowing matching on the resulting tokens instead of the whole text.
#
No Tokenization
This tokenizer does not split the input text, but instead returns the whole text as the only token.
#
Regular Expression Tokenization
This tokenizer extracts the tokens using the provided regular expression.
The provided regular expression must define a token, not the text between the tokens. This way you can easily define tokens that are not separated by some kind of a separator.
#
Example
- Token Regular Expression:
[^,\s]+
(meaning: consecutive characters except comma and space)
Input
Tokens
Smith, John Jim
Smith
John
Jim
- Token Regular Expression:
.{4}
(meaning: 4 consecutive characters)
Input
Tokens
CCTTACTTATAATGCTCATGCTA
CCTT
ACTT
ATAA
TGCT
CATG
#
Word-Based Tokenization
This tokenizers extracts the tokens at their word boundaries.
It is comparable to a regular expression tokenizer using the following
expression: \p{L}+
. That means a token only contains letters, but not numbers.
Input
Tokens
1John2Jim-Smith
John
Jim
Smith