# Matcher Reference

The following matchers are available for your rules configuration.

# complete

The complete matcher compares different fields with each other and is satisfied if the compared fields are equal.

Example configuration:

{
  "id":"<uuid>",
  "type":"complete",
  "attributes":{
    "comparison":[
      {
        "a":"$.someField",
        "b":"$.someField"
      },
      {
        "a":"$.someOtherField",
        "b":"$.yetAnotherField"
      }
    ],
    "modifier":"onlyNumbers"
  }
}

This matcher will be satisfied if someField from A is equal to someField from B and someOtherField from A is equal to yetAnotherField from B.

You can provide as many a/b combinations in the comparison configuration as you would like. Just keep in mind, that all of these must match in order to satisfy the matcher (AND).

Optionally you can provide a modifier which will alter the values before the comparison.

If one of the fields is missing or has an empty value, the matcher will not be satisfied.

# exact

The exact matcher compares one field on A and B with each other and is satisfied if both values are equal and not empty.

Example configuration:

{
  "id":"<uuid>",
  "type":"exact",
  "attributes":{
    "field":"$.someField",
    "modifier":"onlyNumbers"
  }
}

Optionally you can provide a modifier which will alter the values before the comparison.

If one of the fields is missing or has an empty value, the matcher will not be satisfied.

# exactXorEmpty

The exactXorEmpty matcher compares one field on A and B with each other and is satisfied if either both values are equal or both values are empty. It will not match if only one of the values is empty.

Example configuration:

{
  "id":"<uuid>",
  "type":"exactXorEmpty",
  "attributes":{
    "field":"$.someField",
    "modifier":"onlyNumbers"
  }
}

Optionally you can provide a modifier which will alter the values before the comparison. In that case the matcher will be satisfied if the result from the modifier is either equal or is empty for both values.

If one of the fields is missing it will be considered as empty.

# fieldEquality

The fieldEquality matcher compares multiple fields on A and B with each other and is satisfied if all of these fields contain the same non-empty value.

Example configuration:

{
  "id": "<uuid>",
  "type": "fieldEquality",
  "attributes": {
    "fields": ["$.someField", "$.someOtherField"],
    "phonetic": "equal",
    "modifier":"onlyNumbers"
  }
}

The matcher will be satisfied when someField from A equals someField from B and someOtherField from A equals someOtherField from B after the optional phonetic was applied.

Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied to all fields individually.

You can provide as many fields as you would like.

If one of the fields is missing or has an empty value, the matcher will not be satisfied.

# geoDistance

The geoDistance matcher compares geographical coordinates from A and B with each other and is satisfied if they are within a given distance. The coordinates are specified by providing a latitude and a longitude using signed decimal degrees without compass direction, e.g. latitude 53.158953 and longitude 12.793203 instead of 53° 09′ 32″ N, 012° 47′ 36″ E.

Example configuration:

{
  "id": "<uuid>",
  "type": "geoDistance",
  "attributes": {
    "latField": "$.coords.lat",
    "lngField": "$.coords.lng",
    "distance": 5.0,
    "initialDistance": 10.0
  }
}

The latField and the lngField represent the path to receive the value from either A or B. If the distance between the coordinates is less than or equal to the distance (in km) the matcher will be satisfied.

If the value for the latField or the lngField is missing or empty, the matcher will not be satisfied. The valid value 0.0 is not considered empty, but an empty string is.

The optional initialDistance can be ignored in most cases. When indexing records using a geoDistance matcher, TiloRes optimizes how the data is stored for faster searching. This optimization is, beside other things, based on the provided distance. When changing the distance after records have been indexed, this could lead to situations in which fewer or even no data is matched compared to the expected results. If you still want to change the distance afterwards, you can provide the original distance in initialDistance and change distance to whatever you like. However, be aware, that this can reduce the performance when the distance is higher than the initialDistance.

# similarity

The similarity matcher compares two strings using well known standard text similarity algorithms in combination with a threshold.

Example configuration:

{
  "id": "<uuid>",
  "type": "similarity",
  "attributes": {
    "field": "$.someField",
    "shingleSize": 4,
    "exceedingThresholds": 1,
    "algorithms": [
      {
        "type": "QGram",
        "threshold": 0.7
      },
      {
        "type": "SorensenDice",
        "threshold": 0.8
      }
    ],
    "modifier":"onlyNumbers"
  }
}

The similarity will be performed on the field someField.

The shingleSize is relevant for pre-filtering potential candidates. It defines the k-grams used for indexing the data. Depending on your text size we recommend a value of 4 or 5.

The exceedingThresholds define how many of the provided algorithms must result in a higher threshold than the one defined for the algorithm. If not enough thresholds have been reached, then the matcher will not be satisfied. The value for exceedingThresholds must be at least 1 and must not be bigger than the amount of defined algorithms. In the example, the matcher would be satisfied if either the q-gram algorithm returns a value equal to or higher than 0.7 or the Sørensen-Dice algorithm returns a value equal to or higher than 0.8.

Each algorithm that has been defined has its own individual threshold. When comparing values from A and B with each other, then each algorithm returns a value between 0.0 and 1.0, where 1.0 means that the texts are exactly the same and 0.0 means that there is nothing in common.

The following algorithms are currently available:

Optionally you can provide a modifier which will alter the values before the comparison. The modifier is applied before creating the shingles.

# token

The token matcher compares tokens of a field in A with tokens from another field in B. A token hereby is defined as consecutive unicode letters, defined via the following regular expression:

[\p{L}]+

Example configuration:

{
  "id": "<uuid>",
  "type": "token",
  "attributes": {
    "comparison": [
      {
        "a":"$.someField",
        "b":"$.someField"
      },
      {
        "a":"$.someOtherField",
        "b":"$.yetAnotherField"
      }
    ],
    "ratio": "0.5",
    "phonetic": "equal",
    "modifier":"substr3"
  }
}

All three attributes comparison, ratio and phonetic are required.

You can provide as many a/b combinations in the comparison configuration as you would like. Just keep in mind, that all of these must match in order to satisfy the matcher (AND).

ratio defines the amount tokens that must match with each other. The actual ratio is calculated as {\displaystyle {r = {\frac {t_m}{t_u}}}}, where t_m is the number of matching tokens and t_u is the total number of unique tokens.

Assuming equal phonetic, the texts Jim John and John have {\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {1}{2}} = 0.5}, because there are two unique tokens (Jim and John) in both texts and one matching token (John).

Whereas the texts Jim John and Johnn Jim have {\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {1}{3}} = 0.\overline{3}}, because there are three unique tokens (Jim, Johnn and John) and only one matching token (Jim).

Assuming any phonetic in which John and Johnn are also matching, the previous texts Jim John and Johnn Jim have {\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {2}{3}} = 0.\overline{6}}.

Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied for each resulting token individually.

# tokenOverlap

Similar to the token matcher, the tokenOverlap matcher compares tokens of a field in A with tokens from another field in B. A token hereby is defined as consecutive unicode letters, defined via the following regular expression:

[\p{L}]+

Example configuration:

{
  "id": "<uuid>",
  "type": "tokenOverlap",
  "attributes": {
    "comparison": [
      {
        "a":"$.someField",
        "b":"$.someField"
      },
      {
        "a":"$.someOtherField",
        "b":"$.yetAnotherField"
      }
    ],
    "maxAdditionalTokens": "2",
    "maxProcessableTokens": "6",
    "phonetic": "equal",
    "mode": "TOTAL",
    "modifier":"companyName"
  }
}

The attributes comparison, maxAdditionalTokens and phonetic are required. Where the optional fields are maxProcessableTokens (defaults to 4) and mode (defaults to EXPECT_ONE_EMPTY_LIST).

You can provide as many a/b combinations in the comparison configuration as you would like. Just keep in mind, that all of these must match in order to satisfy the matcher (AND).

Unlike ratio in the token matcher, this matcher defines maxAdditionalTokens which is the amount tokens that are allowed additional to the tokens overlapping between (exist in) both compared fields.

Assuming equal phonetic, the texts Jim John Jackson and Jim have 1 common token Jim and 2 additional tokens John Jackson, so in this case they are considered matching if the maxAdditionalTokens provided is less than or equal to 2. The order of the tokens does not matter, comparing John Jim Jackson with Jim leads to the same results.

The default mode of this matcher considers two texts matching if one text includes all tokens of the other compared text, as presented in the previous example, Whereas Jim John Jackson and Jim Michael Jackson would not match.

The other mode is TOTAL which considers two texts matching if at least one common token exists as long as the number of additional tokens (from texts fields) does not exceed maxAdditionalTokens. So Jim John Jackson and Jim J. Michael Jackson are considered matching if maxAdditionalTokens is set to 1, and NOT matching if set to 2.

maxProcessableTokens is an optional attribute (defaults to 4) that limits the maximum number of tokens present in each of the compared fields. In other words, the number of tokens in the bigger compared field must not exceed the maxProcessableTokens otherwise the compared fields are considered NOT matching even if they are exactly the same!. When comparing Jim J. John Michael Jackson with the exact same text then they are considered matching as long as maxProcessableTokens is set to more than or equal to 5.

Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied for each resulting token individually.