# Matcher Reference

Matchers are used by rules to specify how to match a single attribute. In its simplest form a matcher performs text equality matching on a single field.

Each matcher needs at least one transformation output.

# Advanced Text Matching

The advanced text matcher uses tokenization and text comparisons to check whether if the provided transformation output has a similar value for two records. This is the right choice when you need fuzzy matching for your texts.

In general the advanced text matcher works by getting different tokens from the transformation output using the selected output and then comparing the tokens using the selected token strategy. Each token pair is compared using the selected text comparison.

When choosing no tokenization, then it is not possible to chose a token strategy and instead the whole text is compared using the selected text comparison.

A special option is the generally available "index token combinations" option: it can be used to improve the performance in certain scenarios by ensuring that for each record a minimum number of tokens will be indexed together. This may reduce the number of potential matches, but also means that the text must contain at least that many tokens. It is advisable to use this option exclusively when there is a clear understanding of its performance implications.

The following token strategies are available. All given examples use word based tokenization and exact matching for simplicity.

# Minimum Matches

Minimum matches compares all tokens from text 1 with all tokens from text 2 and is satisfied if at least the minimum number of token matches is reached. The order of the tokens does not matter.

# Examples

  • Minimum Token Matches: 2
Text 1 Text 2 Matches
John John false
John Smith John Smith true
John Smith Smith John true
John Smith John Doe false
John Smith John Smith-Doe true
John Smith Smith John Doe true
John John John Smith true
John Smith-Doe John Smith-Doe true
John Smith Doe Smith John Doe true
John John Smith John Smith-Doe true
Jane Smith-Doe John Smith-Doe true
  • Minimum Token Matches: 3
Text 1 Text 2 Matches
John John false
John Smith John Smith false
John Smith Smith John false
John Smith John Doe false
John Smith John Smith-Doe false
John Smith Smith John Doe false
John John John Smith false
John Smith-Doe John Smith-Doe true
John Smith Doe Smith John Doe true
John John Smith John Smith-Doe true
Jane Smith-Doe John Smith-Doe false

# Same Token Order

Same token order expects both texts to have the same amount of tokens and then compares the first token from text 1 with the first from text 2, the second from text 1 with the second from text 2, and so on. Hence, the order of the tokens matters, but there is no minimum amount of tokens required.

# Examples

Text 1 Text 2 Matches
John John true
John Smith John Smith true
John Smith Smith John false
John Smith John Doe false
John Smith John Smith-Doe false
John Smith Smith John Doe false
John John John Smith false

# Token Overlap

The token overlap matches if there is a certain overlap between the tokens of both texts. It works by comparing all tokens from text 1 with all tokens from text 2 and figures out the combination in which the least amount of additional (unmatched) tokens exist. It is satisfied if the amount of additional tokens is equal to or lower than the configured maximum additional tokens.

This strategy works in two different modes. Either one of the two texts must be fully included in the other text (one of the texts must have no additional tokens) or both texts may have additional tokens, but share at least one token.

Optionally you can remove identical tokens before processing by enabling the corresponding checkbox.

Due to the complexity of this specific strategy it will only work for for a maximum of 8 tokens (default: 6).

# Examples

  • Mode: One input must be completely included in the other one
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: disabled
Text 1 Text 2 Additional Tokens Matches
John John 0 true
John Smith John Smith 0 true
John Smith Smith John 0 true
John Smith John Doe 2 false
John Smith John Smith-Doe 1 true
John Smith Smith John Doe 1 true
John John John Smith 2 false
John Smith John Jim Smith-Doe 2 true
John Smith John Jim Jane Smith-Doe 3 false
John Jim Jane Julia Smith-Doe John Jim Jane Julia Smith-Doe n/a* false
  • Mode: One input must be completely included in the other one
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: enabled
Text 1 Text 2 Additional Tokens Matches
John John John Smith 1 true
  • Mode: At least one token between the inputs must be similar
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: disabled
Text 1 Text 2 Additional Tokens Matches
John John 0 true
John Smith John Smith 0 true
John Smith Smith John 0 true
John Smith John Doe 2 true
John Smith John Smith-Doe 1 true
John Smith Smith John Doe 1 true
John John John Smith 2 true
John Smith John Jim Smith-Doe 2 true
John Smith John Jim Jane Smith-Doe 3 false
John Jim Jane Julia Smith-Doe John Jim Jane Julia Smith-Doe n/a* false

(*) not matched due to too many tokens

# Token Ratio

Token ratio compares all tokens from text 1 with all tokens from text 2 and calculates a ratio of matching tokens. If the ratio is equal to or higher than the configured ratio, then the matcher is satisfied.

The ratio is calculated as {\displaystyle {r = {\frac {t_m}{t_u}}}}, where t_m is the number of matching tokens and t_u is the total number of unique tokens.

# Examples

Text 1 Text 2 Ratio
John John 1.0
John Smith John Smith 1.0
John Smith Smith John 1.0
John Smith John Doe 0.\overline{3}
John Smith John Smith-Doe 0.\overline{6}
John Smith Smith John Doe 0.\overline{6}
John John John Smith 0.5

# Example with Other Text Comparison

Be aware that the the required uniqueness of tokens might change the results when working with non-equal text comparisons.

  • Compare Tokens Using: Phonetic (Soundex)
Text 1 Text 2 Ratio
John Smith John Smith 1.0
John Smith Johnn Smith 0.\overline{6}

# Geographical Distance

The geographical distance compares geographical coordinates from two records with each other and is satisfied if they are within a given distance. The coordinates are specified by providing a latitude and a longitude using signed decimal degrees without compass direction, e.g. latitude 53.158953 and longitude 12.793203 instead of 53° 09′ 32″ N, 012° 47′ 36″ E.

The distance must be provided in km. Use e.g. 0.1 if you want to provide a distance of 100m.

If either the latitude or longitude value is missing or empty, the matcher will not be satisfied. The valid value 0.0 is not considered empty, but an empty string is.

The optional initial distance can be ignored in most cases. When indexing records, Tilores optimizes how the data is stored for faster searching. This optimization is, beside other things, based on the provided distance. When changing the distance after records have been indexed, this could lead to situations in which fewer or even no data is matched compared to the expected results. If you still want to change the distance afterwards, you can provide the original distance in initial distance and change the distance to whatever you like. However, be aware, that this can reduce the performance when the distance is higher than the initial distance.

# Examples

  • Distance: 0.5
Latitude 1 Longitude 1 Latitude 2 Longitude 2 Distance Matches
52.516340 13.377709 52.518753 13.376249 ~0.27km true
52.516340 13.377709 52.514551 13.350095 ~1.89km false

# Simple Text Equality

The simple text equality matcher is satisfied if the provided transformation output has the exact same text value for two records.

# Examples

Text 1 Text 2 Matches
John Smith John Smith true
John Smith Smith John false

# Temporal Distance

The temporal distance matches if two timestamps are within the given time frame.

The temporal distance option is a text of decimal numbers, each with a unit suffix such as "24h" or "2h30m15s". Valid units are "h" (hour), "m" (minute) and "s" (second).

The value for the transformation output must be a valid timestamp in the RFC3339Nano format. Other time formats might be supported, but there is no guarantee.

# Examples

  • Temporal Distance: 24h
Time 1 Time 2 Matches
2023-06-07T11:18:32Z 2023-06-07T12:00:00Z true
2023-06-07T11:18:32Z 2023-06-08T11:18:33Z false