#
Rules Configuration
Outdated
Rule configuration documentation is not yet up to date. We have made major changes and still have the update the documentation. Please refer to our release notes for more details.
Rules are the most crucial part of any Tilores installation. They define how records are connected into entities and also how you can search trough your existing entities.
Ideally, you already have an idea of how you want to match your records. If not we strongly advise you to understand the concept of matching and also of deduplication and get an idea of what is possible using Tilores. Afterwards you should spend a good amount of time to figure out the rules that match your needs.
Rules currently are only configurable by editing a JSON file manually. For future versions we plan to provide you with a nice UI to not worry about the JSON structure.
#
The Basics
Before we can start with the actual configuration, we have to reach a common understanding about the wording.
#
Matching
We talk about matching, when we compare two records with each other with the purpose of identifying whether these two records belong to the same entity or not. The result from this comparison is that these records either match or do not match - there is no uncertainty (you may be able to define thresholds though). The matching happens during the so called assembly process, that is the process that runs after records have been submitted into your Tilores instance and need to be assigned to an entity.
The matching can happen either between two records that have been submitted together or between one record that has been submitted and another record that is already stored in Tilores.
The assembly process typically performs multiple matchings. When a list of records is submitted, then each record from that list is compared with all other records from the same list. Afterwards each record from that list is compared with all records already stored in Tilores. If there is at least one match from the second comparison then the records will be attached to that entity (or merge multiple entities).
#
Edges
You can think of Tilores as a huge undirected graph, where each record represents one node in that graph. Each pair of records that match can be represented as an edge in that graph. Each connected component of that graph then represents an entity.
An edge in Tilores is represented in the following form:
<record-id>:<other-record-id>:<label>
for example:
6373d7c6-c6bb-4a6f-8275-1dcc12b2f4a7:e4e30d6e-31d2-4b4f-b472-73e5b52c6250:R1EXACT
There are two kinds of edges - static edges and rule based edges.
Static edges are created between all records that were submitted together. E.g.
submitting the records 1
, 2
and 3
together will create the following
edges:
1:2:STATIC
1:3:STATIC
2:3:STATIC
Rule based edges on the other side are created whenever a match between two records is found. Also it is possible to have multiple edges between the same records when there are multiple rules matching.
E.g. using the previous records, it would be possible to have matches between
1
and 2
and also between 2
and an already stored record 4
.
The resulting edges could be:
1:2:STATIC
1:3:STATIC
2:3:STATIC
1:3:R1EXACT
2:4:R1EXACT
2:4:R2EXACT
For an easier understanding we will use the following visual representation instead of the textual one.
graph LR subgraph A 1---|STATIC|2 1---|STATIC|3 2---|R1EXACT|4 2---|R2EXACT|4 2---|STATIC|3 1---|R1EXACT|3 end classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F classDef cluster fill:#7FCEF6,stroke:#7FCEF6 classDef edgeLabel background-color:#7FCEF6 classDef label fill:none
#
Searching
We talk about searching when we compare the input search parameters with all other records that have been already submitted into Tilores. It is comparable with the matching, with the exception that the search parameters may have a different structure than the records.
Where the result of matching are edges, the result of searching are hits. They serve a similar purpose, but are structured slightly different for easier use:
{
"2": ["R1EXACT", "R2EXACT"],
"3": ["R1EXACT"]
}
In this example, the search found the records 2
and 3
and lists the rules
that were satisfied.
#
Indexing
Indexing means to make parts of an record available for searching and matching. Generally speaking, every rule that you want to use for either of these two, must also be added to the index. Otherwise it will not yield any results.
We most likely will get rid of the explicit indexing configuration in a future release as there is no reason in not to index every matching and search rule.
#
Default Configuration
The default configuration is stored in your Tilores project folder under
rule-config.json
. Let's have a look at the individual parts of a configuration
file.
The rules
block is a list of rules. Each rule is build from multiple matchers.
For a rule to be satisfied, all of its matchers must be satisfied. Technically
speaking they are AND connected.
The ruleSets
block is a list of rule sets. Rule sets bundle multiple rules for
a specific use, e.g. for the search. For a rule set to be satisfied at least one
of its rules must be satisfied. Technically speaking they are OR connected.
The searchRuleSet
defines which of the previously defined rule sets should be
used for searching. Currently only one rule set can be used for searching - let
us know about your use case if you need more than that.
The mutationRuleSetGroups
defines at least one group and the rule sets to use
for matching, indexing and optionally for deduplication. Later we will show you
how to use the default group, but also under which circumstances you may want to
define multiple groups.
The default configuration comes with exactly one rule. That rule is satisfied if
the myCustomField
from two records (matching) or from a record and the search
parameters (search) are exactly the same. This is the most simple configuration
possible.
#
Example: Customize Configuration
While customizing the rules configuration you can always try it out by running
tilores-cli rules simulate
. For more details refer to rules simulate
and rules test.
Let's create a custom rules configuration. For this example we are going to work with the following schema:
input NameInput {
given: String!
sur: String!
}
input AddressInput {
zip: String!
city: String!
street: String!
houseNumber: String!
}
input RecordInput {
id: ID!
name: NameInput!
address: AddressInput!
dateOfBirth: String # in the format YYYY-MM-DD
someData: String!
}
input SearchParams {
name: NameInput!
address: AddressInput!
dateOfBirth: String # in the format YYYY-MM-DD
}
If it is not clear yet, the entity we want to create is a person. To keep things
simple, we did not specify what exactly someData
is - it could be financial
data, e-com orders, contracts, or what ever may make sense. Also of course it
could be much further structured, but a string should be sufficient here as it
is not relevant for searching nor matching.
To simplify things here, we define the matching/searching relevant fields from
the RecordInput
and the SearchInput
in the same way. This will allow us to
reuse the rules, but this is not necessary.
The first rule we want to create is to match when all the relevant fields are
equal. For these comparisons we can use the fieldEquality
matcher:
{
"id": "R1EXACT",
"name": "Rule 1: exact matching on all relevant fields",
"matchers": [
{
"id": "matcher-1",
"type": "fieldEquality",
"attributes": {
"fields": [
"$.name.given",
"$.name.sur",
"$.address.zip",
"$.address.city",
"$.address.street",
"$.address.houseNumber",
"$.dateOfBirth"
]
}
}
]
}
Let's break it down. Each rule must have a unique ID, in this case R1EXACT
.
This ID will be the label of an edge or shown in the hits. You can use this for
example for quality measurements.
Each rule also should have a name. This is currently only for you to understand what this rule is supposed to do, but we may later show this in a UI.
Then as explained previously, each rule must have at least one matcher. Each one
with a unique ID and a type. Since we only want to ensure, that each field is
equal, we can use the fieldEquality
matcher here. This one accepts a list of
fields and for each field it will compare the actual value from both records or
the record and the search parameters.
A field is referred to using a field expression.
However, we now have one small issue. The fieldEquality
matcher requires a
value for every single field. Even if both parts that need to be compared are
empty, the matcher still will not match. In our case that might happen for the
optional field dateOfBirth
.
Instead of making this a required field or ignoring it completely, we introduce
a second matcher for the same rule, the exactXorEmpty
matcher. Our
configuration now looks like this:
{
"id": "R1EXACT",
"name": "Rule 1: exact matching on all relevant fields",
"matchers": [
{
"id": "matcher-1",
"type": "fieldEquality",
"attributes": {
"fields": [
"$.name.given",
"$.name.sur",
"$.address.zip",
"$.address.city",
"$.address.street",
"$.address.houseNumber"
]
}
},
{
"id": "matcher-2",
"type": "exactXorEmpty",
"attributes": {
"field": "$.dateOfBirth"
}
}
]
}
The exactXorEmpty
matcher will match if either both values are equal or both
values are empty, but never if only one of the values is empty.
Obviously, what this rule does, could easily be done with every single relational database out there. Let's increase the challenge and try to create a rule, that would match these two records:
{
"name": {
"given": "Jim John",
"sur": "Smith"
},
"address": {
"zip": "10028",
"city": "New York",
"street": "5th Ave.",
"houseNumber": "1000"
},
"dateOfBirth": "1990-12-31"
}
{
"name": {
"given": "Johnn",
"sur": "Smith"
},
"address": {
"zip": "10128",
"city": "New York",
"street": "E 90th St",
"houseNumber": "2"
},
"dateOfBirth": "1990-12-31"
}
For the address, you may simply want to verify that it is still the same
city. You can easily use an exact
matcher for that. Matching the name is more
challenging. We have a spelling mistake in John
and the first record has two
given names - the token
matcher is a perfect fit for that. Additionally you
may want to add the same matcher for the date of birth again, in this example
we did not include it.
The following rule would match these two records.
{
"id": "R2SIMILAR",
"name": "Rule 2: phonetic token match and same city",
"matchers": [
{
"id": "matcher-3",
"type": "token",
"attributes": {
"comparison": [
{
"a": "$.name.given",
"b": "$.name.given"
},
{
"a": "$.name.sur",
"b": "$.name.sur"
}
],
"ratio": "0.3",
"phonetic": "dynamicCologneLevenshtein"
}
},
{
"id": "matcher-4",
"type": "exact",
"attributes": {
"field": "$.address.city"
}
}
]
}
For more details on these settings, please refer to the matcher documentation.
The whole configuration could now look something like this:
{
"rules": [
{
"id": "R1EXACT",
"name": "Rule 1: exact matching on all relevant fields",
"matchers": [
{
"id": "matcher-1",
"type": "fieldEquality",
"attributes": {
"fields": [
"$.name.given",
"$.name.sur",
"$.address.zip",
"$.address.city",
"$.address.street",
"$.address.houseNumber"
]
}
},
{
"id": "matcher-2",
"type": "exactXorEmpty",
"attributes": {
"field": "$.dateOfBirth"
}
}
]
},
{
"id": "R2SIMILAR",
"name": "Rule 2: phonetic token match and same city",
"matchers": [
{
"id": "matcher-3",
"type": "token",
"attributes": {
"comparison": [
{
"a": "$.name.given",
"b": "$.name.given"
},
{
"a": "$.name.sur",
"b": "$.name.sur"
}
],
"ratio": "0.3",
"phonetic": "dynamicCologneLevenshtein"
}
},
{
"id": "matcher-4",
"type": "exact",
"attributes": {
"field": "$.address.city"
}
}
]
}
],
"ruleSets": [
{
"id": "all",
"name": "All available rules.",
"rules": [
"R1EXACT",
"R2SIMILAR"
]
}
],
"searchRuleSet": "all",
"mutationRuleSetGroups": {
"default": {
"index": "all",
"dedup": "",
"match": "all"
}
}
}
Since we use the same rules for matching and searching, it is sufficient to create only one rule set and reuse this.
#
Deduplication
Deduplication means to recognize that two records contain the same matching relevant data. In this case we talk about so called duplicates, more precisely non-identical duplicates.
When working with a lot of data, it is crucial for your applications success to understand how deduplication works.
An identical duplicate is easy to explain. If two records have the same value for all their fields and it clearly does not provide any new information, then this is an identical duplicate.
For non-identical duplicates it is slightly more difficult. Consider the following two records and presume, that we want to use the same rules from the previous example.
{
"id": "1",
"name": {
"given": "Jim John",
"sur": "Smith"
},
"address": {
"zip": "10028",
"city": "New York",
"street": "5th Ave.",
"houseNumber": "1000"
},
"dateOfBirth": "1990-12-31",
"someData": "this is some data"
}
{
"id": "2",
"name": {
"given": "Jim John",
"sur": "Smith"
},
"address": {
"zip": "10028",
"city": "New York",
"street": "5th Ave.",
"houseNumber": "1000"
},
"dateOfBirth": "1990-12-31",
"someData": "and this is other data"
}
Because the values for someData
are clearly different, this cannot be an
identical duplicate. However, all fields that we use for matching are exactly the
same.
Visualizing this, we may end up with:
graph LR subgraph A 1---|R1EXACT|2 1---|R1SIMILAR|2 end classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F classDef cluster fill:#7FCEF6,stroke:#7FCEF6 classDef edgeLabel background-color:#7FCEF6 classDef label fill:none
While this is still somewhat understandable, lets assume, that there are four of these records:
graph LR subgraph A 1---|R1EXACT|3 1---|R1SIMILAR|3 1---|R1EXACT|2 1---|R1SIMILAR|2 1---|R1EXACT|4 1---|R1SIMILAR|4 2---|R1EXACT|3 2---|R1SIMILAR|3 2---|R1EXACT|4 2---|R1SIMILAR|4 3---|R1EXACT|4 3---|R1SIMILAR|4 end classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F classDef cluster fill:#7FCEF6,stroke:#7FCEF6 classDef edgeLabel background-color:#7FCEF6 classDef label fill:none
It might already be difficult to see, but every record is connected with every other record via two edges. Just try to imaging how this would look like for 20, 100 or even 1,000 records.
But it is not just the visualization that is tricky, it is also the amount of edges stored and also each record needs to be indexed, which would slow down the search. And if you would query the edges via your API all of these edges need to be transferred to your client.
The formula to calculate the amount of edges for a single rule is: {\displaystyle {\frac {n^2-n}{2}}}, where n is the amount of non-identical duplicates. For 1,000 non-identical duplicates, this would result in a total of 499,500 edges per rule!
Tilores provides you with the possibility to define a rule to identify non-identical duplicates and treat them differently when it comes to indexing and edges. Don't worry though, you will still be able to find and query all stored data.
A typical deduplication rule verifies that all relevant fields must match
exactly. In our example, we already have such a rule defined: R1EXACT
and can
assign it in our configuration accordingly.
{
"rules": [...],
"ruleSets": [
{
"id": "all",
"name": "All available rules.",
"rules": [
"R1EXACT",
"R2SIMILAR"
]
},
{
"id": "deduplication",
"name": "Only deduplication rules.",
"rules": [
"R1EXACT"
]
}
],
"searchRuleSet": "all",
"mutationRuleSetGroups": {
"default": {
"index": "all",
"dedup": "deduplication",
"match": "all"
}
}
}
The four records from before would now be stored much more efficiently:
graph LR subgraph A 1-.-|DUP|2 1-.-|DUP|3 1-.-|DUP|4 class 2,3,4 dup end classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F classDef cluster fill:#7FCEF6,stroke:#7FCEF6 classDef edgeLabel background-color:#7FCEF6 classDef label fill:none classDef dup stroke-width:2px,fill:#fff,stroke:#FFE4D6,color:#aaa
When searching, you will now only receive one hit on record 1
instead of all
available records - but the returned entity and also the returned records of
that entity will still be the same as before - no data is lost.
Furthermore, when record 1
is removed, we will recognize that and reorganize
the available duplicates to have a proper original again. This is also true for
more complex scenarios in which record 1
may have had other edges to other
(non duplicate) records.
#
Rule Groups
In some situations, finding a good deduplication rule can be a tough challenge.
Let's introduce a new email
field on our schema:
input RecordInput {
id: ID!
name: NameInput!
address: AddressInput!
dateOfBirth: String # in the format YYYY-MM-DD
email: String
someData: String!
}
We want to keep the existing rules, that don't care about the email, but additionally we want to introduce a new rule, that matches only via the email address. Again, it is up to you to decide whether such a rule makes sense or not.
{
"id": "R3EMAIL",
"name": "Rule 3: exact email address",
"matchers": [
{
"id": "matcher-5",
"type": "exact",
"attributes": {
"field": "$.email"
}
}
]
}
If we would just add that new rule to our existing rule set all
, then we would
end up with an issue. Remember that a deduplication rule should include all the
matching relevant fields, but our deduplication rule R1EXACT
does not include
the email
field. A possible workaround would be to introduce a rule D1DEDUP
,
that is a copy of R1EXACT
and includes the email
field. Unfortunately, this
would reduce the amount of possible duplicates, e.g. all records not providing a
value for the optional email
field would no longer be recognized as
duplicates. The same applies for every record that does provide a different
email address - it would no longer be a duplicate, but still match on R1EXACT
and R2SIMILAR
, possibly resulting in an exponential growth in edges.
Tilores provides you with a small, but powerful feature to bypass these issues: the rule groups. Rule groups let you define rule sets for matching, searching and indexing independent from other groups. The result is, that a record can be both a duplicate in one group and no duplicate in another group.
For our previous rules, we can change the configuration like this:
{
"rules": [...],
"ruleSets": [
{
"id": "all",
"name": "All available rules for searching.",
"rules": [
"R1EXACT",
"R2SIMILAR",
"R3EMAIL"
]
},
{
"id": "name-address",
"name": "Rules for name and address matching.",
"rules": [
"R1EXACT",
"R2SIMILAR"
]
},
{
"id": "name-address-deduplication",
"name": "Only deduplication rules.",
"rules": [
"R1EXACT"
]
},
{
"id": "email",
"name": "Rules for email matching.",
"rules": [
"R3EMAIL"
]
}
],
"searchRuleSet": "all",
"mutationRuleSetGroups": {
"g1": {
"index": "name-address",
"dedup": "name-address-deduplication",
"match": "name-address"
},
"g2": {
"index": "email",
"dedup": "email",
"match": "email"
}
}
}
Look at the following records:
{
"id": "1",
"name": {
"given": "Jim John",
"sur": "Smith"
},
"address": {
"zip": "10028",
"city": "New York",
"street": "5th Ave.",
"houseNumber": "1000"
},
"dateOfBirth": "1990-12-31",
"email": "john.smith@example.com",
"someData": "and this is other data"
}
{
"id": "2",
"name": {
"given": "Johnn",
"sur": "Smith"
},
"address": {
"zip": "10128",
"city": "New York",
"street": "E 90th St",
"houseNumber": "2"
},
"dateOfBirth": "1990-12-31",
"email": "john.smith@example.com",
"someData": "and this is other data"
}
{
"id": "3",
"name": {
"given": "Jim John",
"sur": "Smith"
},
"address": {
"zip": "10028",
"city": "New York",
"street": "5th Ave.",
"houseNumber": "1000"
},
"dateOfBirth": "1990-12-31",
"email": "j.smith@work.com",
"someData": "again some other data"
}
Record 2
is now a duplicate of record 1
in the group g2
, but not in g1
.
While record 3
is a duplicate of record 1
in group g1
, but not in g2
.
We could visualize the edges and duplicates in the following way:
graph LR subgraph A 1---|G1-R2SIMILAR|2 1-.-|G1-DUP|2 1-.-|G1-DUP|3 2---|G1-R2SIMILAR|3 end classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F classDef cluster fill:#7FCEF6,stroke:#7FCEF6 classDef edgeLabel background-color:#7FCEF6 classDef label fill:none
To summarize, thinking about the right rules should be your highest priority. We unfortunately can only give you a rough idea about how the configuration works here and encourage you to play around as much as possible with different rules.