# Tokenizer Reference

Tokenizers are used by some matchers to split a single text into multiple parts, allowing matching on the resulting tokens instead of the whole text.

# No Tokenization

This tokenizer does not split the input text, but instead returns the whole text as the only token.

# Regular Expression Tokenization

This tokenizer extracts the tokens using the provided regular expression.

The provided regular expression must define a token, not the text between the tokens. This way you can easily define tokens that are not separated by some kind of a separator.

# Example

  • Token Regular Expression: [^,\s]+ (meaning: consecutive characters except comma and space)
Input
Tokens
Smith, John Jim
Smith
John
Jim
  • Token Regular Expression: .{4} (meaning: 4 consecutive characters)
Input
Tokens
CCTTACTTATAATGCTCATGCTA
CCTT
ACTT
ATAA
TGCT
CATG

# Word-Based Tokenization

This tokenizers extracts the tokens at their word boundaries.

It is comparable to a regular expression tokenizer using the following expression: \p{L}+. That means a token only contains letters, but not numbers.

Input
Tokens
1John2Jim-Smith
John
Jim
Smith