# Frequently Asked Questions

# General

# Are entity IDs stable in Tilores?

Tilores strives to keep entity IDs as stable as possible. However, due to the inherent complexity of entity resolution, entity IDs may change over time. This behavior is common across all entity resolution systems.

When a new record does not match any existing entity, Tilores creates a new entity ID.

When a new record matches exactly one existing entity, the record is added and the entity ID remains unchanged.

When a new record matches multiple existing entities, Tilores retains the entity ID of the oldest entity (based on creation time). If creation times are equal, the largest entity keeps its ID.

When records or edges are removed but the entity remains connected, the entity ID is retained.

When removals cause an entity to split into multiple entities, the largest resulting entity retains the original ID, while others receive new IDs.

When the last record of an entity is removed, its entity ID is deleted. Re-ingesting the same record will not restore the previous entity ID.

# Is data from multiple customers stored together in Tilores?

No. Each customer receives an isolated Tilores instance with its own AWS resources, including DynamoDB, S3, and Lambda functions. No customer data is shared across instances.

# Rule Configuration

# How are previously ingested records affected when new matching rules are added?

Existing data is not updated automatically. New data will follow the new matching rules. To apply the new rules to existing data, re-ingest the same records using their original record IDs.

# What is deduplication?

In general, deduplication is another term for entity resolution. In the context of Tilores, it refers to a specific set of rules designed to identify records that share identical matching-relevant attributes while differing in others. These are called non-identical duplicates, and the first processed record with the shared information is considered the original.

Tilores uses deduplication mainly for performance optimization. Because duplicates contain the same matching-relevant data as the original, certain operations can be skipped and data volume can be reduced.

For even higher performance gains, partial deduplication is available through rule groups.

As with all entity records, duplicates remain fully accessible at query time. For example, querying the first names of an entity containing one original and three duplicates will return all four identical first-name values.

# API

# What is the reason for choosing a GraphQL API?

A GraphQL API offers maximum flexibility for querying complex data. Since entity resolution deals with large volumes of often duplicated data, GraphQL enables precise aggregation, filtering, and grouping of attributes tailored to specific needs.

The API includes comprehensive, self-contained documentation that can be queried directly without additional tools.

Its HTTP-based protocol keeps implementation effort similar to REST APIs. Below is an example request body that can be sent to the GraphQL endpoint via a standard HTTP POST request:

{
  "query": "query MySearch($search: SearchParams!) {search(input:{parameters:$search}){entities {id}}}",
  "variables": {
    "search": {
      "myCustomField": "some value"
    }
  }
}

The example above uses variables. Although it is possible to hard-code the myCustomField parameter directly in the query, this approach is discouraged because weak parameter validation on the client-side could allow query injection attacks.

# How can I retrieve metadata like creation timestamps?

Tilores’ GraphQL schema is flexible, letting you define your data structure to fit your use case. This includes custom metadata fields, such as creation timestamps or data source references. All fields are accessible in search results, either directly or in aggregated form.

# How can multiple records be submitted in one request?

Tilores does not currently support submitting multiple independent records in a single request. While this could theoretically be done using GraphQL aliases, it is not recommended.

Instead, submit records in batches using concurrent requests. The serverless endpoint scales automatically to handle the load. The batch-graphql tool can process a JSON lines file and submit records concurrently, and it can also be used for batch querying of resulting entities.

# Why can I submit a list of records even though batch submissions are not supported?

In most cases, the list contains only a single record. However, there are situations where you know multiple records belong to the same entity even if they do not share common data.

For instance, when building a person entity, two records might have different surnames and addresses, but domain knowledge indicates they represent the same individual, such as after a marriage. In such cases, both records can be submitted in a single request and will be statically linked, even without other matching information.

# Why does it take a long time for my ingested data to be fully processed?

Tilores assembles records with high concurrency. When multiple records influence the same resulting entity, a pessimistic locking strategy is used: the first record for that entity is processed before the next one.

If the submitted data is ordered in a way that increases the likelihood of multiple concurrent records belonging to the same entity, concurrency drops and processing may become effectively serial. This commonly happens when data comes from an existing system or is sorted by certain attributes.

Shuffling the records before ingestion usually restores high concurrency and significantly improves processing performance.

# Deployment

The following questions apply only to the self-hosted version of Tilores. For the managed version, the correct settings are automatically configured.

# How can the optimal configuration for an initial data import be determined?

A single record usually takes 200–500ms to be assembled in a Lambda function, allowing 2 to 5 records per second per function, depending on the matching configuration complexity.

For processing 10 million records within one hour, at least 556 concurrent Lambda functions are required:

10,000,000 / 3600s / 5 ≈ 556

There is no strict limit on concurrent Lambda functions. The maximum depends on the chosen raw data queue type and its configuration.

With SQS (default), use this value directly for assemble_parallelization_sqs.

With Kinesis, set assemble_parallelization_factor to 10 and rawdata_stream_shard_count to 56, resulting in up to 560 concurrent assemble Lambda functions.

For most scenarios, SQS is the recommended deployment option. Kinesis incurs hourly costs per shard regardless of usage.

Ensure your AWS account has sufficient quota for Lambda concurrent executions.

# What causes high levels of throttling in DynamoDB or S3?

When a DynamoDB table or S3 bucket is first created, it uses default partitioning, which supports only a limited data throughput. AWS automatically scales resources over time as needed.

Repartitioning in DynamoDB usually takes only a few minutes. Refer to the default table and account quotas.

Repartitioning in S3 can take longer, depending on the number of objects, and may take several hours.