# Tokenizer Reference

Tokenizers are used by some matchers to split a single text into multiple parts, allowing matching on the resulting tokens instead of the whole text.

# No Tokenization

This tokenizer does not split the input text, but instead returns the whole text as the only token.

# Regular Expression Tokenization

This tokenizer extracts the tokens using the provided regular expression.

The provided regular expression must define a token, not the text between the tokens. This way you can easily define tokens that are not separated by some kind of a separator.

# Example

Token Regular Expression: [^,\s]+ (meaning: consecutive characters except comma and space)

Input

Tokens

Smith, John Jim

Smith
John
Jim

Token Regular Expression: .{4} (meaning: 4 consecutive characters)

Input

Tokens

CCTTACTTATAATGCTCATGCTA

CCTT
ACTT
ATAA
TGCT
CATG

# Word-Based Tokenization

This tokenizers extracts the tokens at their word boundaries.

It is comparable to a regular expression tokenizer using the following expression: \p{L}+. That means a token only contains letters, but not numbers.

Input

Tokens

1John2Jim-Smith

John
Jim
Smith