# Text Comparison Reference

Text comparison is used to define if two texts match. They can be used from within matchers that support them.

# Distinct Texts

Distinct text will consider two texts matching if they are not identical.

Distinct text must not be used in a rule with a single matcher that is configured to match using distinct texts as this would create a lot of matches (everything matches except the identical records). Always use this in combination with some stricter matcher or text comparison on other attributes.

# Examples

Text 1 Text 2 Matches
Tilores Tilores false
Tilores tilores true
Tilores resoliT true

# Dynamic Text Distance

The dynamic text distance is used to refine more generic approaches.

Consider the following example: comparing the two names Mia and Moe using using any of the phonetic algorithms results in a match because often vowels are ignored during phonetic matching. Also the Levenshtein distance of those two words is only 2, which is a good value for longer texts but way too much for three character names.

With the dynamic text distance, you can define a base comparison for finding potential matches (this can be any of the other text comparisons) and specify a table of maximum allowed distances for specific text lengths.

The available distance algorithms are the same as for the text distance.

# Example

The default configuration provides the following table:

Minimum Length of Shortest Text Maximum Allowed Distance
0 0
4 1

This means, that for texts with up to 3 characters, no changes are allowed (exact matching). Texts with at least 4 characters are allowed to have a distance of 1.

The previous example of Mia and Moe would no longer match. And also Steven and Stefan would not match, despite sounding similar, due to a Levenshtein distance of 2.

This table can be adjusted according to your needs. E.g., adding another column with 6 and 2 would result in Steven and Stefan to match.

# Exact Text

Exact text will consider two texts matching if they are identical.

# Examples

Text 1 Text 2 Matches
Tilores Tilores true
Tilores tilores false
Tilores resoliT false

# Nickname Similarity

Nickname similarity can be used to determine if two first names are similar nicknames. This is done by looking up the provided name or nickname in a pre-defined alias list.

Because nicknames are often similar among a lot of different names (e.g. Chris might refer to Christian, Christine, Christopher and many more), this often results in potential mismatches and should only be used in combination with other stricter matchers.

One way to further reduce the mismatches is to ignore name matching for common names. This can either be done using the remove common name transformation beforehand or by enabling the corresponding option.

# Examples

Text 1 Text 2 Ignore Common Names Enabled Matches
Mike Michael false true
Mike Michael true false
Mike Jim false false
Esther Essie true true

# Phonetic Similarity

Phonetic similarity can be used to determine if two texts sound similar. This is typically done by converting the texts into a phonetic abstracting using a phonetic algorithm and then comparing the resulting codes for equality.

The biggest advantage of phonetic similarity is its extremely low performance impact.

The following phonetic algorithms are currently supported.

# Cologne Phonetic

This phonetic uses the Kölner Phonetik Algorithm, which is similar to the Soundex phonetic algorithm, but specialized for the German language.

# Examples

Text 1 Text 2 Phonetic Code 1 Phonetic Code 2 Matches
Steven Stefan 8236 8236 true
Steven Hendrik 8236 06274 false

# Metaphone

This phonetic uses the Metaphone Algorithm, which is similar to the Soundex phonetic algorithm, but with more accurate results.

# Examples

Text 1 Text 2 Phonetic Code 1 Phonetic Code 2 Matches
Steven Stefan STFN STFN true
Steven Hendrik STFN NTRK false

# Soundex

This uses the Soundex Algorithm, a relatively simple way of producing same outputs for similar sounding words.

Text 1 Text 2 Phonetic Code 1 Phonetic Code 2 Matches
Steven Stefan S315 S315 true
Steven Hendrik S315 H536 false

# Text Distance

Text distance can be used to determine if to texts are structurally similar. This is typically done by calculating an edit distance and then checking if the distance is below or equal to the maximum allowed distance.

Text distance must not be used in a rule with a single matcher that is configured to match using distinct texts as this would create a lot of potential matches due to the way text distance is indexed.

The following text distance algorithms are currently supported:

Some of those algorithms are also available as a text similarity, offering a length independent score.

# Examples

For examples and playing with your own data, we recommend our fuzzy matching tool to see those algorithms in action.

# Text Similarity

Text similarity can be used to determine if two texts are structurally similar. This is typically done by calculating a similarity score using a similarity algorithm and then checking if the similarity score exceeds a configured threshold.

The threshold must be a numeric value between 0.0 and 1.0. A value of 0.0 means that all texts are matching, and a value of 1.0 means that only identical texts are matching. For most use cases a value between 0.7 and 0.9 should be good.

The shingle size is relevant for pre-filtering potential candidates. It defines the k-grams used for indexing the data. Depending on your text size we recommend a value of 4 or 5. A low value of e.g. 2 might have a negative performance impact, while a very high value might miss potential matches. Algorithms that use shingles for calculating the similarity score, e.g. Cosine Similarity, will only use this for finding potential matches but use a shingle size of 2 internally for the final score.

Text similarity is often less performant than phonetic similarity but mostly yields better results.

The following text similarity algorithms are currently supported:

Some of those algorithms are also available as a text distance, offering a precise control over the maximum distance.

# Examples

For examples and playing with your own data, we recommend our fuzzy matching tool to see those algorithms in action.