# Matcher Reference

Matchers are used by rules to specify how to match a single attribute. In its simplest form a matcher performs text equality matching on a single field.

Each matcher needs at least one transformation output.

# Advanced Text Matching

The advanced text matcher uses tokenization and text comparisons to check whether if the provided transformation output has a similar value for two records. This is the right choice when you need fuzzy matching for your texts.

In general the advanced text matcher works by getting different tokens from the transformation output using the selected output and then comparing the tokens using the selected token strategy. Each token pair is compared using the selected text comparison.

When choosing no tokenization, then it is not possible to chose a token strategy and instead the whole text is compared using the selected text comparison.

A special option is the generally available "index token combinations" option: it can be used to improve the performance in certain scenarios by ensuring that for each record a minimum number of tokens will be indexed together. This may reduce the number of potential matches, but also means that the text must contain at least that many tokens. It is advisable to use this option exclusively when there is a clear understanding of its performance implications.

The following token strategies are available. All given examples use word based tokenization and exact matching for simplicity.

# Minimum Matches

Minimum matches compares all tokens from text 1 with all tokens from text 2 and is satisfied if at least the minimum number of token matches is reached. The order of the tokens does not matter.

# Examples

  • Minimum Token Matches: 2
Text 1 Text 2 Matches
John John false
John Smith John Smith true
John Smith Smith John true
John Smith John Doe false
John Smith John Smith-Doe true
John Smith Smith John Doe true
John John John Smith true
John Smith-Doe John Smith-Doe true
John Smith Doe Smith John Doe true
John John Smith John Smith-Doe true
Jane Smith-Doe John Smith-Doe true
  • Minimum Token Matches: 3
Text 1 Text 2 Matches
John John false
John Smith John Smith false
John Smith Smith John false
John Smith John Doe false
John Smith John Smith-Doe false
John Smith Smith John Doe false
John John John Smith false
John Smith-Doe John Smith-Doe true
John Smith Doe Smith John Doe true
John John Smith John Smith-Doe true
Jane Smith-Doe John Smith-Doe false

# Same Token Order

Same token order expects both texts to have the same amount of tokens and then compares the first token from text 1 with the first from text 2, the second from text 1 with the second from text 2, and so on. Hence, the order of the tokens matters, but there is no minimum amount of tokens required.

# Examples

Text 1 Text 2 Matches
John John true
John Smith John Smith true
John Smith Smith John false
John Smith John Doe false
John Smith John Smith-Doe false
John Smith Smith John Doe false
John John John Smith false

# Token Overlap

The token overlap matches if there is a certain overlap between the tokens of both texts. It works by comparing all tokens from text 1 with all tokens from text 2 and figures out the combination in which the least amount of additional (unmatched) tokens exist. It is satisfied if the amount of additional tokens is equal to or lower than the configured maximum additional tokens.

This strategy works in two different modes. Either one of the two texts must be fully included in the other text (one of the texts must have no additional tokens) or both texts may have additional tokens, but share at least one token.

Optionally you can remove identical tokens before processing by enabling the corresponding checkbox.

Due to the complexity of this specific strategy it will only work for for a maximum of 8 tokens (default: 6).

# Examples

  • Mode: One input must be completely included in the other one
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: disabled
Text 1 Text 2 Additional Tokens Matches
John John 0 true
John Smith John Smith 0 true
John Smith Smith John 0 true
John Smith John Doe 2 false
John Smith John Smith-Doe 1 true
John Smith Smith John Doe 1 true
John John John Smith 2 false
John Smith John Jim Smith-Doe 2 true
John Smith John Jim Jane Smith-Doe 3 false
John Jim Jane Julia Smith-Doe John Jim Jane Julia Smith-Doe n/a* false
  • Mode: One input must be completely included in the other one
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: enabled
Text 1 Text 2 Additional Tokens Matches
John John John Smith 1 true
  • Mode: At least one token between the inputs must be similar
  • Maximum Additional Tokens: 2
  • Maximum Processable Tokens: 5
  • Remove Duplicate Tokens: disabled
Text 1 Text 2 Additional Tokens Matches
John John 0 true
John Smith John Smith 0 true
John Smith Smith John 0 true
John Smith John Doe 2 true
John Smith John Smith-Doe 1 true
John Smith Smith John Doe 1 true
John John John Smith 2 true
John Smith John Jim Smith-Doe 2 true
John Smith John Jim Jane Smith-Doe 3 false
John Jim Jane Julia Smith-Doe John Jim Jane Julia Smith-Doe n/a* false

(*) not matched due to too many tokens

# Token Ratio

Token ratio compares all tokens from text 1 with all tokens from text 2 and calculates a ratio of matching tokens. If the ratio is equal to or higher than the configured ratio, then the matcher is satisfied.

The ratio is calculated as {\displaystyle {r = {\frac {t_m}{t_u}}}}, where t_m is the number of matching tokens and t_u is the total number of unique tokens.

# Examples

Text 1 Text 2 Ratio
John John 1.0
John Smith John Smith 1.0
John Smith Smith John 1.0
John Smith John Doe 0.\overline{3}
John Smith John Smith-Doe 0.\overline{6}
John Smith Smith John Doe 0.\overline{6}
John John John Smith 0.5

# Example with Other Text Comparison

Be aware that the the required uniqueness of tokens might change the results when working with non-equal text comparisons.

  • Compare Tokens Using: Phonetic (Soundex)
Text 1 Text 2 Ratio
John Smith John Smith 1.0
John Smith Johnn Smith 0.\overline{6}

# Empty Aware Matcher

The empty aware matcher wraps an existing matcher and is satisfied if the wrapped matcher is satisfied or if one or both values are empty.

# Examples

  • Using Simple Text Equality as a wrapped matcher
Text 1 Text 2 Wrapped Matcher Result Matches
John John John Smith true true
John John John Doe false false
John John false true
false true

# Geographical Distance

The geographical distance compares geographical coordinates from two records with each other and is satisfied if they are within a given distance. The coordinates are specified by providing a latitude and a longitude using signed decimal degrees without compass direction, e.g. latitude 53.158953 and longitude 12.793203 instead of 53° 09′ 32″ N, 012° 47′ 36″ E.

The distance must be provided in km. Use e.g. 0.1 if you want to provide a distance of 100m.

If either the latitude or longitude value is missing or empty, the matcher will not be satisfied. The valid value 0.0 is not considered empty, but an empty string is.

The optional initial distance can be ignored in most cases. When indexing records, Tilores optimizes how the data is stored for faster searching. This optimization is, beside other things, based on the provided distance. When changing the distance after records have been indexed, this could lead to situations in which fewer or even no data is matched compared to the expected results. If you still want to change the distance afterwards, you can provide the original distance in initial distance and change the distance to whatever you like. However, be aware, that this can reduce the performance when the distance is higher than the initial distance.

# Examples

  • Distance: 0.5
Latitude 1 Longitude 1 Latitude 2 Longitude 2 Distance Matches
52.516340 13.377709 52.518753 13.376249 ~0.27km true
52.516340 13.377709 52.514551 13.350095 ~1.89km false

# Probabilistic Matching

The probabilistic matching uses statistical methods to calculate the similarity between records. The result of the probabilistic matching is a similarity score. In order for two records to match, this similarity score must exceed a certain threshold.

In order to use a probabilistic matcher it must first be trained using unlabeled example data. Training and score calculation will only use selected transformation outputs (features). This allows to exclude fields that are not related to the matching process or should be used in within another matcher.

Combining probabilistic matching with deterministic matching (e.g. simple text equality), enables very fine grained control over the matching process. As an example, you may want to use probabilistic matching on the name and date of birth fields, but use exact matching on a city field.

To configure the probabilistic matching, you must define the relevant features. Each selected feature must be assigned at least one comparer. Comparers should be ordered from most specific to the most fuzzy one. If you are unsure which comparers to use, it is acceptable to choose a single exact comparer. Often this will yield good initial results already.

example configuration of probabilistic matching
example configuration of probabilistic matching

Probabilistic matching requires a good blocking strategy. Blocking prevents long model training times by clustering the data into blocks. Instead of comparing each record with all other records, it will only compare each record from that block with all other records from the same block. By default Tilores automatically creates blocks that are small enough for a good training and model performance. Depending on the features, it might be necessary to take over control of the blocking.

# Automatic Blocking with All Features

This is the default strategy. It works by clustering the data of n features into the same block. This works well for cases where all features have a high cardinality/uniqueness (e.g. names, addresses, etc.). If low cardinality features (e.g. gender, title, etc.) are available, then this may cause a few large blocks.

# Automatic Blocking with Selected Features

This blocking strategy works the same way as the previous one, except that it is possible to explicitly exclude low cardinality features for blocking. It will still use these features for training, but without the unintended effects for selecting potentially matching record pairs.

# Manual Blocking

This blocking strategy allows you to select the features that you want to combine for blocking. While automatic blocking always creates blocks with n features, you can choose any possible feature combination with manual blocking.

example configuration of manual blocking
example configuration of manual blocking

# Simple Text Equality

The simple text equality matcher is satisfied if the provided transformation output has the exact same text value for two records.

# Examples

Text 1 Text 2 Matches
John Smith John Smith true
John Smith Smith John false

# Temporal Distance

The temporal distance matches if two timestamps are within the given time frame.

The temporal distance option is a text of decimal numbers, each with a unit suffix such as "24h" or "2h30m15s". Valid units are "h" (hour), "m" (minute) and "s" (second).

The value for the transformation output must be a valid timestamp in the RFC3339Nano format. Other time formats might be supported, but there is no guarantee.

# Examples

  • Temporal Distance: 24h
Time 1 Time 2 Matches
2023-06-07T11:18:32Z 2023-06-07T12:00:00Z true
2023-06-07T11:18:32Z 2023-06-08T11:18:33Z false