#
Text Comparison Reference
Text comparison is used to define if two texts match. They can be used from within matchers that support them.
#
Distinct Texts
Distinct text will consider two texts matching if they are not identical.
Distinct text must not be used in a rule with a single matcher that is configured to match using distinct texts as this would create a lot of matches (everything matches except the identical records). Always use this in combination with some stricter matcher or text comparison on other attributes.
#
Examples
#
Dynamic Text Distance
The dynamic text distance is used to refine more generic approaches.
Consider the following example: comparing the two names Mia
and Moe
using
using any of the phonetic algorithms results in a match because often vowels
are ignored during phonetic matching. Also the Levenshtein distance of those
two words is only 2, which is a good value for longer texts but way too much for
three character names.
With the dynamic text distance, you can define a base comparison for finding potential matches (this can be any of the other text comparisons) and specify a table of maximum allowed distances for specific text lengths.
The available distance algorithms are the same as for the
#
Example
The default configuration provides the following table:
This means, that for texts with up to 3 characters, no changes are allowed (exact matching). Texts with at least 4 characters are allowed to have a distance of 1.
The previous example of Mia
and Moe
would no longer match. And also Steven
and Stefan
would not match, despite sounding similar, due to a Levenshtein
distance of 2.
This table can be adjusted according to your needs. E.g., adding another column
with 6 and 2 would result in Steven
and Stefan
to match.
#
Exact Text
Exact text will consider two texts matching if they are identical.
#
Examples
#
Nickname Similarity
Nickname similarity can be used to determine if two first names are similar nicknames. This is done by looking up the provided name or nickname in a pre-defined alias list.
Because nicknames are often similar among a lot of different names (e.g. Chris might refer to Christian, Christine, Christopher and many more), this often results in potential mismatches and should only be used in combination with other stricter matchers.
One way to further reduce the mismatches is to ignore name matching for common names. This can either be done using the remove common name transformation beforehand or by enabling the corresponding option.
#
Examples
#
Phonetic Similarity
Phonetic similarity can be used to determine if two texts sound similar. This is typically done by converting the texts into a phonetic abstracting using a phonetic algorithm and then comparing the resulting codes for equality.
The biggest advantage of phonetic similarity is its extremely low performance impact.
The following phonetic algorithms are currently supported.
#
Cologne Phonetic
This phonetic uses the Kölner Phonetik Algorithm, which is similar to the Soundex phonetic algorithm, but specialized for the German language.
#
Examples
#
Metaphone
This phonetic uses the Metaphone Algorithm, which is similar to the Soundex phonetic algorithm, but with more accurate results.
#
Examples
#
Soundex
This uses the Soundex Algorithm, a relatively simple way of producing same outputs for similar sounding words.
#
Text Distance
Text distance can be used to determine if to texts are structurally similar. This is typically done by calculating an edit distance and then checking if the distance is below or equal to the maximum allowed distance.
Text distance must not be used in a rule with a single matcher that is configured to match using distinct texts as this would create a lot of potential matches due to the way text distance is indexed.
The following text distance algorithms are currently supported:
- Damerau-Levenshtein Distance with Adjacent Transpositions
- Damerau-Levenshtein with Optimal String Alignment Distance
- Hamming Distance
- LCS (Longest Common Subsequence)
- Levenshtein Distance
Some of those algorithms are also available as a text similarity, offering a length independent score.
#
Examples
For examples and playing with your own data, we recommend our fuzzy matching tool to see those algorithms in action.
#
Text Similarity
Text similarity can be used to determine if two texts are structurally similar. This is typically done by calculating a similarity score using a similarity algorithm and then checking if the similarity score exceeds a configured threshold.
The threshold must be a numeric value between 0.0 and 1.0. A value of 0.0 means that all texts are matching, and a value of 1.0 means that only identical texts are matching. For most use cases a value between 0.7 and 0.9 should be good.
The shingle size is relevant for pre-filtering potential candidates. It defines the k-grams used for indexing the data. Depending on your text size we recommend a value of 4 or 5. A low value of e.g. 2 might have a negative performance impact, while a very high value might miss potential matches. Algorithms that use shingles for calculating the similarity score, e.g. Cosine Similarity, will only use this for finding potential matches but use a shingle size of 2 internally for the final score.
Text similarity is often less performant than phonetic similarity but mostly yields better results.
The following text similarity algorithms are currently supported:
- Cosine Similarity
- Damerau-Levenshtein Distance with Adjacent Transpositions
- Damerau-Levenshtein with Optimal String Alignment Distance
- Hamming Distance
- Jaccard
- Jaro Similarity
- Jaro-Winkler Similarity
- LCS (Longest Common Subsequence)
- Levenshtein Distance
- Sørensen–Dice Coefficient
- q-gram Similarity
- Fuzzy Wuzzy
Some of those algorithms are also available as a text distance, offering a precise control over the maximum distance.
#
Examples
For examples and playing with your own data, we recommend our fuzzy matching tool to see those algorithms in action.