# Rules Configuration

Rules are the most crucial part of any Tilores installation. They define how records are connected into entities and also how you can search trough your existing entities.

Ideally, you already have an idea of how you want to match your records. If not we strongly advise you to understand the concept of matching and also of deduplication and get an idea of what is possible using Tilores. Afterwards you should spend a good amount of time to figure out the rules that match your needs.

# The Basics

Before we can start with the actual configuration, we have to reach a common understanding about the wording.

# Matching

We talk about matching, when we compare two records with each other with the purpose of identifying whether these two records belong to the same entity or not. The result from this comparison is that these records either match or do not match - there is no uncertainty (you may be able to define thresholds though). The matching happens during the so called assembly process, that is the process that runs after records have been submitted into your Tilores instance and need to be assigned to an entity.

The matching can happen either between two records that have been submitted together or between one record that has been submitted and another record that is already stored in Tilores.

The assembly process typically performs multiple matchings. When a list of records is submitted, then each record from that list is compared with all other records from the same list. Afterwards each record from that list is compared with all records already stored in Tilores. If there is at least one match from the second comparison then the records will be attached to that entity (or merge multiple entities).

# Edges

You can think of Tilores as a huge undirected graph, where each record represents one node in that graph. Each pair of records that match can be represented as an edge in that graph. Each connected component of that graph then represents an entity.

possible visualization of the graph and its components - the entities
possible visualization of the graph and its components - the entities

An edge in Tilores is represented in the following form:

<record-id>:<other-record-id>:<label>

for example:

6373d7c6-c6bb-4a6f-8275-1dcc12b2f4a7:e4e30d6e-31d2-4b4f-b472-73e5b52c6250:R1EXACT

There are two kinds of edges - static edges and rule based edges.

Static edges are created between all records that were submitted together. E.g. submitting the records 1, 2 and 3 together will create the following edges:

1:2:STATIC
1:3:STATIC
2:3:STATIC

Rule based edges on the other side are created whenever a match between two records is found. Also it is possible to have multiple edges between the same records when there are multiple rules matching.

E.g. using the previous records, it would be possible to have matches between 1 and 2 and also between 2 and an already stored record 4.

The resulting edges could be:

1:2:STATIC
1:3:STATIC
2:3:STATIC
1:3:R1EXACT
2:4:R1EXACT
2:4:R2EXACT

For an easier understanding we will use the following visual representation instead of the textual one.

graph LR
subgraph A
  1---|STATIC|2
  1---|STATIC|3
  2---|R1EXACT|4
  2---|R2EXACT|4
  2---|STATIC|3
  1---|R1EXACT|3
end
classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F
classDef cluster fill:#7FCEF6,stroke:#7FCEF6
classDef edgeLabel background-color:#7FCEF6
classDef label fill:none

# Searching

We talk about searching when we compare the input search parameters with all other records that have been already submitted into Tilores. It is comparable with the matching, with the exception that the search parameters may have a different structure than the records.

Where the result of matching are edges, the result of searching are hits. They serve a similar purpose, but are structured slightly different for easier use:

{
  "2": ["R1EXACT", "R2EXACT"],
  "3": ["R1EXACT"]
}

In this example, the search found the records 2 and 3 and lists the rules that were satisfied.

# Indexing

Indexing means to make parts of an record available for searching and matching. Generally speaking, every rule that you want to use for either of these two, must also be added to the index. Otherwise it will not yield any results.

# Default Configuration

The default configuration is stored in your Tilores project folder under rule-config.json. Let's have a look at the individual parts of a configuration file.

The rules block is a list of rules. Each rule is build from multiple matchers. For a rule to be satisfied, all of its matchers must be satisfied. Technically speaking they are AND connected.

The ruleSets block is a list of rule sets. Rule sets bundle multiple rules for a specific use, e.g. for the search. For a rule set to be satisfied at least one of its rules must be satisfied. Technically speaking they are OR connected.

The searchRuleSet defines which of the previously defined rule sets should be used for searching. Currently only one rule set can be used for searching - let us know about your use case if you need more than that.

The mutationRuleSetGroups defines at least one group and the rule sets to use for matching, indexing and optionally for deduplication. Later we will show you how to use the default group, but also under which circumstances you may want to define multiple groups.

The default configuration comes with exactly one rule. That rule is satisfied if the myCustomField from two records (matching) or from a record and the search parameters (search) are exactly the same. This is the most simple configuration possible.

# Example: Customize Configuration

Let's create a custom rules configuration. For this example we are going to work with the following schema:

input NameInput {
  given: String!
  sur: String!
}

input AddressInput {
  zip: String!
  city: String!
  street: String!
  houseNumber: String!
}

input RecordInput {
  id: ID!
  name: NameInput!
  address: AddressInput!
  dateOfBirth: String # in the format YYYY-MM-DD
  someData: String!
}

input SearchParams {
  name: NameInput!
  address: AddressInput!
  dateOfBirth: String # in the format YYYY-MM-DD
}

If it is not clear yet, the entity we want to create is a person. To keep things simple, we did not specify what exactly someData is - it could be financial data, e-com orders, contracts, or what ever may make sense. Also of course it could be much further structured, but a string should be sufficient here as it is not relevant for searching nor matching.

To simplify things here, we define the matching/searching relevant fields from the RecordInput and the SearchInput in the same way. This will allow us to reuse the rules, but this is not necessary.

The first rule we want to create is to match when all the relevant fields are equal. For these comparisons we can use the fieldEquality matcher:

{
  "id": "R1EXACT",
  "name": "Rule 1: exact matching on all relevant fields",
  "matchers": [
    {
      "id": "matcher-1",
      "type": "fieldEquality",
      "attributes": {
        "fields": [
          "$.name.given",
          "$.name.sur",
          "$.address.zip",
          "$.address.city",
          "$.address.street",
          "$.address.houseNumber",
          "$.dateOfBirth"
        ]
      }
    }
  ]
}

Let's break it down. Each rule must have a unique ID, in this case R1EXACT. This ID will be the label of an edge or shown in the hits. You can use this for example for quality measurements.

Each rule also should have a name. This is currently only for you to understand what this rule is supposed to do, but we may later show this in a UI.

Then as explained previously, each rule must have at least one matcher. Each one with a unique ID and a type. Since we only want to ensure, that each field is equal, we can use the fieldEquality matcher here. This one accepts a list of fields and for each field it will compare the actual value from both records or the record and the search parameters.

A field is referred to using a field expression.

However, we now have one small issue. The fieldEquality matcher requires a value for every single field. Even if both parts that need to be compared are empty, the matcher still will not match. In our case that might happen for the optional field dateOfBirth.

Instead of making this a required field or ignoring it completely, we introduce a second matcher for the same rule, the exactXorEmpty matcher. Our configuration now looks like this:

{
  "id": "R1EXACT",
  "name": "Rule 1: exact matching on all relevant fields",
  "matchers": [
    {
      "id": "matcher-1",
      "type": "fieldEquality",
      "attributes": {
        "fields": [
          "$.name.given",
          "$.name.sur",
          "$.address.zip",
          "$.address.city",
          "$.address.street",
          "$.address.houseNumber"
        ]
      }
    },
    {
      "id": "matcher-2",
      "type": "exactXorEmpty",
      "attributes": {
        "field": "$.dateOfBirth"
      }
    }
  ]
}

The exactXorEmpty matcher will match if either both values are equal or both values are empty, but never if only one of the values is empty.

Obviously, what this rule does, could easily be done with every single relational database out there. Let's increase the challenge and try to create a rule, that would match these two records:

Record 1
Record 2
{
  "name": {
    "given": "Jim John",
    "sur": "Smith"
  },
  "address": {
    "zip": "10028",
    "city": "New York",
    "street": "5th Ave.",
    "houseNumber": "1000"
  },
  "dateOfBirth": "1990-12-31"
}
{
  "name": {
    "given": "Johnn",
    "sur": "Smith"
  },
  "address": {
    "zip": "10128",
    "city": "New York",
    "street": "E 90th St",
    "houseNumber": "2"
  },
  "dateOfBirth": "1990-12-31"
}

For the address, you may simply want to verify that it is still the same city. You can easily use an exact matcher for that. Matching the name is more challenging. We have a spelling mistake in John and the first record has two given names - the token matcher is a perfect fit for that. Additionally you may want to add the same matcher for the date of birth again, in this example we did not include it.

The following rule would match these two records.

{
  "id": "R2SIMILAR",
  "name": "Rule 2: phonetic token match and same city",
  "matchers": [
    {
      "id": "matcher-3",
      "type": "token",
      "attributes": {
        "comparison": [
          {
            "a": "$.name.given",
            "b": "$.name.given"
          },
          {
            "a": "$.name.sur",
            "b": "$.name.sur"
          }
        ],
        "ratio": "0.3",
        "phonetic": "dynamicCologneLevenshtein"
      }
    },
    {
      "id": "matcher-4",
      "type": "exact",
      "attributes": {
        "field": "$.address.city"
      }
    }
  ]
}

For more details on these settings, please refer to the matcher documentation.

The whole configuration could now look something like this:

{
  "rules": [
    {
      "id": "R1EXACT",
      "name": "Rule 1: exact matching on all relevant fields",
      "matchers": [
        {
          "id": "matcher-1",
          "type": "fieldEquality",
          "attributes": {
            "fields": [
              "$.name.given",
              "$.name.sur",
              "$.address.zip",
              "$.address.city",
              "$.address.street",
              "$.address.houseNumber"
            ]
          }
        },
        {
          "id": "matcher-2",
          "type": "exactXorEmpty",
          "attributes": {
            "field": "$.dateOfBirth"
          }
        }
      ]
    },
    {
      "id": "R2SIMILAR",
      "name": "Rule 2: phonetic token match and same city",
      "matchers": [
        {
          "id": "matcher-3",
          "type": "token",
          "attributes": {
            "comparison": [
              {
                "a": "$.name.given",
                "b": "$.name.given"
              },
              {
                "a": "$.name.sur",
                "b": "$.name.sur"
              }
            ],
            "ratio": "0.3",
            "phonetic": "dynamicCologneLevenshtein"
          }
        },
        {
          "id": "matcher-4",
          "type": "exact",
          "attributes": {
            "field": "$.address.city"
          }
        }
      ]
    }
  ],
  "ruleSets": [
    {
      "id": "all",
      "name": "All available rules.",
      "rules": [
        "R1EXACT",
        "R2SIMILAR"
      ]
    }
  ],
  "searchRuleSet": "all",
  "mutationRuleSetGroups": {
    "default": {
      "index": "all",
      "dedup": "",
      "match": "all"
    }
  }
}

Since we use the same rules for matching and searching, it is sufficient to create only one rule set and reuse this.

# Deduplication

Deduplication means to recognize that two records contain the same matching relevant data. In this case we talk about so called duplicates, more precisely non-identical duplicates.

An identical duplicate is easy to explain. If two records have the same value for all their fields and it clearly does not provide any new information, then this is an identical duplicate.

For non-identical duplicates it is slightly more difficult. Consider the following two records and presume, that we want to use the same rules from the previous example.

Record 1
Record 2
{
  "id": "1",
  "name": {
    "given": "Jim John",
    "sur": "Smith"
  },
  "address": {
    "zip": "10028",
    "city": "New York",
    "street": "5th Ave.",
    "houseNumber": "1000"
  },
  "dateOfBirth": "1990-12-31",
  "someData": "this is some data"
}
{
  "id": "2",
  "name": {
    "given": "Jim John",
    "sur": "Smith"
  },
  "address": {
    "zip": "10028",
    "city": "New York",
    "street": "5th Ave.",
    "houseNumber": "1000"
  },
  "dateOfBirth": "1990-12-31",
  "someData": "and this is other data"
}

Because the values for someData are clearly different, this cannot be an identical duplicate. However, all fields that we use for matching are exactly the same.

Visualizing this, we may end up with:

graph LR
subgraph A
  1---|R1EXACT|2
  1---|R1SIMILAR|2
end
classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F
classDef cluster fill:#7FCEF6,stroke:#7FCEF6
classDef edgeLabel background-color:#7FCEF6
classDef label fill:none

While this is still somewhat understandable, lets assume, that there are four of these records:

graph LR
subgraph A
  1---|R1EXACT|3
  1---|R1SIMILAR|3
  1---|R1EXACT|2
  1---|R1SIMILAR|2
  1---|R1EXACT|4
  1---|R1SIMILAR|4
  2---|R1EXACT|3
  2---|R1SIMILAR|3
  2---|R1EXACT|4
  2---|R1SIMILAR|4
  3---|R1EXACT|4
  3---|R1SIMILAR|4
end
classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F
classDef cluster fill:#7FCEF6,stroke:#7FCEF6
classDef edgeLabel background-color:#7FCEF6
classDef label fill:none

It might already be difficult to see, but every record is connected with every other record via two edges. Just try to imaging how this would look like for 20, 100 or even 1,000 records.

But it is not just the visualization that is tricky, it is also the amount of edges stored and also each record needs to be indexed, which would slow down the search. And if you would query the edges via your API all of these edges need to be transferred to your client.

The formula to calculate the amount of edges for a single rule is: {\displaystyle {\frac {n^2-n}{2}}}, where n is the amount of non-identical duplicates. For 1,000 non-identical duplicates, this would result in a total of 499,500 edges per rule!

Tilores provides you with the possibility to define a rule to identify non-identical duplicates and treat them differently when it comes to indexing and edges. Don't worry though, you will still be able to find and query all stored data.

A typical deduplication rule verifies that all relevant fields must match exactly. In our example, we already have such a rule defined: R1EXACT and can assign it in our configuration accordingly.

{
  "rules": [...],
  "ruleSets": [
    {
      "id": "all",
      "name": "All available rules.",
      "rules": [
        "R1EXACT",
        "R2SIMILAR"
      ]
    },
    {
      "id": "deduplication",
      "name": "Only deduplication rules.",
      "rules": [
        "R1EXACT"
      ]
    }
  ],
  "searchRuleSet": "all",
  "mutationRuleSetGroups": {
    "default": {
      "index": "all",
      "dedup": "deduplication",
      "match": "all"
    }
  }
}

The four records from before would now be stored much more efficiently:

graph LR
subgraph A
  1-.-|DUP|2
  1-.-|DUP|3
  1-.-|DUP|4
  class 2,3,4 dup
end
classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F
classDef cluster fill:#7FCEF6,stroke:#7FCEF6
classDef edgeLabel background-color:#7FCEF6
classDef label fill:none
classDef dup stroke-width:2px,fill:#fff,stroke:#FFE4D6,color:#aaa

When searching, you will now only receive one hit on record 1 instead of all available records - but the returned entity and also the returned records of that entity will still be the same as before - no data is lost.

Furthermore, when record 1 is removed, we will recognize that and reorganize the available duplicates to have a proper original again. This is also true for more complex scenarios in which record 1 may have had other edges to other (non duplicate) records.

# Rule Groups

In some situations, finding a good deduplication rule can be a tough challenge.

Let's introduce a new email field on our schema:

input RecordInput {
  id: ID!
  name: NameInput!
  address: AddressInput!
  dateOfBirth: String # in the format YYYY-MM-DD
  email: String
  someData: String!
}

We want to keep the existing rules, that don't care about the email, but additionally we want to introduce a new rule, that matches only via the email address. Again, it is up to you to decide whether such a rule makes sense or not.

{
  "id": "R3EMAIL",
  "name": "Rule 3: exact email address",
  "matchers": [
    {
      "id": "matcher-5",
      "type": "exact",
      "attributes": {
        "field": "$.email"
      }
    }
  ]
}

If we would just add that new rule to our existing rule set all, then we would end up with an issue. Remember that a deduplication rule should include all the matching relevant fields, but our deduplication rule R1EXACT does not include the email field. A possible workaround would be to introduce a rule D1DEDUP, that is a copy of R1EXACT and includes the email field. Unfortunately, this would reduce the amount of possible duplicates, e.g. all records not providing a value for the optional email field would no longer be recognized as duplicates. The same applies for every record that does provide a different email address - it would no longer be a duplicate, but still match on R1EXACT and R2SIMILAR, possibly resulting in an exponential growth in edges.

Tilores provides you with a small, but powerful feature to bypass these issues: the rule groups. Rule groups let you define rule sets for matching, searching and indexing independent from other groups. The result is, that a record can be both a duplicate in one group and no duplicate in another group.

For our previous rules, we can change the configuration like this:

{
  "rules": [...],
  "ruleSets": [
    {
      "id": "all",
      "name": "All available rules for searching.",
      "rules": [
        "R1EXACT",
        "R2SIMILAR",
        "R3EMAIL"
      ]
    },
    {
      "id": "name-address",
      "name": "Rules for name and address matching.",
      "rules": [
        "R1EXACT",
        "R2SIMILAR"
      ]
    },
    {
      "id": "name-address-deduplication",
      "name": "Only deduplication rules.",
      "rules": [
        "R1EXACT"
      ]
    },
    {
      "id": "email",
      "name": "Rules for email matching.",
      "rules": [
        "R3EMAIL"
      ]
    }
  ],
  "searchRuleSet": "all",
  "mutationRuleSetGroups": {
    "g1": {
      "index": "name-address",
      "dedup": "name-address-deduplication",
      "match": "name-address"
    },
    "g2": {
      "index": "email",
      "dedup": "email",
      "match": "email"
    }
  }
}

Look at the following records:

Record 1
Record 2
Record 3
{
  "id": "1",
  "name": {
    "given": "Jim John",
    "sur": "Smith"
  },
  "address": {
    "zip": "10028",
    "city": "New York",
    "street": "5th Ave.",
    "houseNumber": "1000"
  },
  "dateOfBirth": "1990-12-31",
  "email": "john.smith@example.com",
  "someData": "and this is other data"
}
{
  "id": "2",
  "name": {
    "given": "Johnn",
    "sur": "Smith"
  },
  "address": {
    "zip": "10128",
    "city": "New York",
    "street": "E 90th St",
    "houseNumber": "2"
  },
  "dateOfBirth": "1990-12-31",
  "email": "john.smith@example.com",
  "someData": "and this is other data"
}
{
  "id": "3",
  "name": {
    "given": "Jim John",
    "sur": "Smith"
  },
  "address": {
    "zip": "10028",
    "city": "New York",
    "street": "5th Ave.",
    "houseNumber": "1000"
  },
  "dateOfBirth": "1990-12-31",
  "email": "j.smith@work.com",
  "someData": "again some other data"
}

Record 2 is now a duplicate of record 1 in the group g2, but not in g1. While record 3 is a duplicate of record 1 in group g1, but not in g2.

We could visualize the edges and duplicates in the following way:

graph LR
subgraph A
  1---|G1-R2SIMILAR|2
  1-.-|G1-DUP|2
  1-.-|G1-DUP|3
  2---|G1-R2SIMILAR|3
end
classDef node stroke-width:2px,fill:#fff,stroke:#F6A77F
classDef cluster fill:#7FCEF6,stroke:#7FCEF6
classDef edgeLabel background-color:#7FCEF6
classDef label fill:none

To summarize, thinking about the right rules should be your highest priority. We unfortunately can only give you a rough idea about how the configuration works here and encourage you to play around as much as possible with different rules.