#
Matcher Reference
Outdated
Rule configuration documentation is not yet up to date. We have made major changes and still have the update the documentation. Please refer to our release notes for more details.
The following matchers are available for your rules configuration.
Following A
will always represent the record that is already stored/indexed in
Tilores and B
will either represent the record that will be submitted or
represent the search parameters.
#
complete
The complete
matcher compares different fields with each other and is
satisfied if the compared fields are equal.
Example configuration:
{
"id":"<uuid>",
"type":"complete",
"attributes":{
"comparison":[
{
"a":"$.someField",
"b":"$.someField"
},
{
"a":"$.someOtherField",
"b":"$.yetAnotherField"
}
],
"modifier":"onlyNumbers"
}
}
This matcher will be satisfied if someField
from A
is equal to someField
from B
and someOtherField
from A
is equal to yetAnotherField
from B
.
You can provide as many a
/b
combinations in the comparison
configuration
as you would like. Just keep in mind, that all of these must match in order to
satisfy the matcher (AND).
Optionally you can provide a modifier which will alter the values before the comparison.
If one of the fields is missing or has an empty value, the matcher will not be satisfied.
#
exact
The exact
matcher compares one field on A
and B
with each other and is
satisfied if both values are equal and not empty.
Example configuration:
{
"id":"<uuid>",
"type":"exact",
"attributes":{
"field":"$.someField",
"modifier":"onlyNumbers"
}
}
Optionally you can provide a modifier which will alter the values before the comparison.
If one of the fields is missing or has an empty value, the matcher will not be satisfied.
#
exactXorEmpty
The exactXorEmpty
matcher compares one field on A
and B
with each other
and is satisfied if either both values are equal or both values are empty. It
will not match if only one of the values is empty.
Example configuration:
{
"id":"<uuid>",
"type":"exactXorEmpty",
"attributes":{
"field":"$.someField",
"modifier":"onlyNumbers"
}
}
Optionally you can provide a modifier which will alter the values before the comparison. In that case the matcher will be satisfied if the result from the modifier is either equal or is empty for both values.
If one of the fields is missing it will be considered as empty.
#
fieldEquality
The fieldEquality
matcher compares multiple fields on A
and B
with each
other and is satisfied if all of these fields contain the same non-empty value.
Example configuration:
{
"id": "<uuid>",
"type": "fieldEquality",
"attributes": {
"fields": ["$.someField", "$.someOtherField"],
"phonetic": "equal",
"modifier":"onlyNumbers"
}
}
The matcher will be satisfied when someField
from A
equals someField
from
B
and someOtherField
from A
equals someOtherField
from B
after the
optional phonetic was applied.
Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied to all fields individually.
You can provide as many fields as you would like.
If one of the fields is missing or has an empty value, the matcher will not be satisfied.
#
geoDistance
The geoDistance
matcher compares geographical coordinates from A
and B
with each other and is satisfied if they are within a given distance. The
coordinates are specified by providing a latitude and a longitude using signed
decimal degrees without compass direction, e.g. latitude 53.158953
and
longitude 12.793203
instead of 53° 09′ 32″ N, 012° 47′ 36″ E
.
Example configuration:
{
"id": "<uuid>",
"type": "geoDistance",
"attributes": {
"latField": "$.coords.lat",
"lngField": "$.coords.lng",
"distance": 5.0,
"initialDistance": 10.0
}
}
The latField
and the lngField
represent the path to receive the value from
either A
or B
. If the distance between the coordinates is less than or equal to the
distance
(in km) the matcher will be satisfied.
If the value for the latField
or the lngField
is missing or empty, the
matcher will not be satisfied. The valid value 0.0
is not considered empty,
but an empty string is.
The optional initialDistance
can be ignored in most cases. When indexing
records using a geoDistance
matcher, Tilores optimizes how the data is stored
for faster searching. This optimization is, beside other things, based on the
provided distance
. When changing the distance
after records have been
indexed, this could lead to situations in which fewer or even no data is
matched compared to the expected results. If you still want to change the
distance afterwards, you can provide the original distance in initialDistance
and change distance
to whatever you like. However, be aware, that this can
reduce the performance when the distance
is higher than the initialDistance
.
When writing a deduplication rule for a rule with a
geoDistance
matcher you can either replace that matcher using a
fieldEquality
matcher or change the distance to a relatively low value, e.g.
ten meters. We recommend the latter approach. This way coordinates that
represent the same thing, e.g. a house, will not be affected by slightly
different coordinates from different data sources. 52.516235, 13.377707
and 52.516263, 13.377766
probably refer to the same object beside the
coordinates being five meters away.
#
similarity
The similarity
matcher compares two strings using well known standard text
similarity algorithms in combination with a threshold.
Example configuration:
{
"id": "<uuid>",
"type": "similarity",
"attributes": {
"field": "$.someField",
"shingleSize": 4,
"exceedingThresholds": 1,
"algorithms": [
{
"type": "QGram",
"threshold": 0.7
},
{
"type": "SorensenDice",
"threshold": 0.8
}
],
"modifier":"onlyNumbers"
}
}
The similarity will be performed on the field someField
.
The shingleSize
is relevant for pre-filtering potential candidates. It defines
the k-grams used for indexing the data. Depending on your text size we recommend
a value of 4 or 5.
The exceedingThresholds
define how many of the provided algorithms
must
result in a higher threshold than the one defined for the algorithm. If not
enough thresholds have been reached, then the matcher will not be satisfied. The
value for exceedingThresholds
must be at least 1 and must not be bigger than
the amount of defined algorithms
. In the example, the matcher would be
satisfied if either the q-gram algorithm returns a value equal to or higher than
0.7
or the Sørensen-Dice algorithm returns a value equal to or higher than
0.8
.
Each algorithm
that has been defined has its own individual threshold. When
comparing values from A
and B
with each other, then each algorithm returns
a value between 0.0
and 1.0
, where 1.0
means that the texts are exactly
the same and 0.0
means that there is nothing in common.
The following algorithms are currently available:
- Cosine (Cosine similarity)
- DamerauLevenshteinAT (Damerau-Levenshtein distance with adjacent transpositions)
- DamerauLevenshteinOSA (Damerau-Levenshtein with optimal string alignment distance)
- Hamming (Hamming distance)
- Jaccard (Jaccard)
- Jaro (Jaro similarity)
- JaroWinkler (Jaro-Winkler similarity)
- LCS (longest common subsequence)
- Levenshtein (Levenshtein distance)
- SorensenDice (Sørensen–Dice coefficient)
- QGram (q-gram)
Optionally you can provide a modifier which will alter the values before the comparison. The modifier is applied before creating the shingles.
#
token
The token
matcher compares tokens of a field in A
with tokens from another
field in B
. A token hereby is defined as consecutive unicode letters, defined
via the following regular expression:
[\p{L}]+
Example configuration:
{
"id": "<uuid>",
"type": "token",
"attributes": {
"comparison": [
{
"a":"$.someField",
"b":"$.someField"
},
{
"a":"$.someOtherField",
"b":"$.yetAnotherField"
}
],
"ratio": "0.5",
"phonetic": "equal",
"modifier":"substr3"
}
}
All three attributes comparison
, ratio
and phonetic
are required.
You can provide as many a
/b
combinations in the comparison
configuration
as you would like. Just keep in mind, that all of these must match in order to
satisfy the matcher (AND).
ratio
defines the amount tokens that must match with each other. The actual
ratio is calculated as {\displaystyle {r = {\frac {t_m}{t_u}}}}, where t_m is
the number of matching tokens and t_u is the total number of unique tokens.
Assuming equal phonetic, the texts Jim John
and John
have
{\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {1}{2}} = 0.5}, because there
are two unique tokens (Jim
and John
) in both texts and one matching token
(John
).
Whereas the texts Jim John
and Johnn Jim
have
{\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {1}{3}} = 0.\overline{3}},
because there are three unique tokens (Jim
, Johnn
and John
) and only one
matching token (Jim
).
Assuming any phonetic in which John
and Johnn
are also matching, the
previous texts Jim John
and Johnn Jim
have
{\displaystyle {r = {\frac {t_m}{t_u}}} = {\frac {2}{3}} = 0.\overline{6}}.
Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied for each resulting token individually.
#
tokenOverlap
When matching names where other fields such as date of birth are also included in the same rule which allows for a more
loose name matching then we would recommend using ratio
makes it scale better
for longer names.
Similar to the token matcher, the tokenOverlap
matcher compares tokens of a field in A
with tokens from another
field in B
. A token hereby is defined as consecutive unicode letters, defined
via the following regular expression:
[\p{L}]+
Example configuration:
{
"id": "<uuid>",
"type": "tokenOverlap",
"attributes": {
"comparison": [
{
"a":"$.someField",
"b":"$.someField"
},
{
"a":"$.someOtherField",
"b":"$.yetAnotherField"
}
],
"maxAdditionalTokens": "2",
"maxProcessableTokens": "6",
"phonetic": "equal",
"mode": "TOTAL",
"modifier":"companyName"
}
}
The attributes comparison
, maxAdditionalTokens
and phonetic
are required. Where the optional fields
are maxProcessableTokens
(defaults to 4
) and mode
(defaults to EXPECT_ONE_EMPTY_LIST
).
You can provide as many a
/b
combinations in the comparison
configuration
as you would like. Just keep in mind, that all of these must match in order to
satisfy the matcher (AND).
Unlike ratio
in the token
matcher, this matcher defines maxAdditionalTokens
which is the amount tokens
that are allowed additional to the tokens overlapping between (exist in) both compared fields.
Assuming equal phonetic, the texts Jim John Jackson
and Jim
have 1 common token Jim
and 2 additional tokens
John Jackson
, so in this case they are considered matching if the maxAdditionalTokens
provided is less than or equal
to 2
. The order of the tokens does not matter, comparing John Jim Jackson
with Jim
leads to the same results.
The default mode
of this matcher considers two texts matching if one text includes all tokens of the other compared
text, as presented in the previous example, Whereas Jim John Jackson
and Jim Michael Jackson
would not match.
The other mode is TOTAL
which considers two texts matching if at least one common token exists as long as the number
of additional tokens (from texts fields) does not exceed maxAdditionalTokens
. So Jim John Jackson
and
Jim J. Michael Jackson
are considered matching if maxAdditionalTokens
is set to 1, and NOT matching if set to 2.
maxProcessableTokens
is an optional attribute (defaults to 4
) that limits the maximum number of tokens present in
each of the compared fields. In other words, the number of tokens in the bigger compared field must not exceed the
maxProcessableTokens
otherwise the compared fields are considered NOT matching even if they are exactly the same!.
When comparing Jim J. John Michael Jackson
with the exact same text then they are considered
matching as long as maxProcessableTokens
is set to more than or equal to 5
.
Optionally you can provide a modifier which will alter the values before the comparison. The modifier will be applied for each resulting token individually.