Understanding the Structure of Data in Reltio Analytics

You can learn about the data structure in Reltio Analytics.

Once the data is delivered, two datasets for the tenant are generated. The name of the datasets are as follows:

  • Project id: <to be specific per customer>
  • Datasets
    • views_riq_dw_<env>_<tenant> (these are the materialized views)
    • riq_dw_<env>_<tenant> (this contains raw data)

Understanding Reltio Analytics Table Field Names

The tables created in Reltio Analytics (through batch and streaming) use names of entities, relations, interactions, matches, and merges as fields. For more information, see the example below.

The following views are delivered into the views dataset:

Table 1. Analytics View Name
Reltio Attribute
view_entities_crosswalks_count
view_entities_crosswalks_count_no_date_crosswalks_source
view_entities_crosswalks_count_no_date_entity_type
view_entities_crosswalks_created_metadata
view_entities_crosswalks_deleted_metadata
view_entities_crosswalks_updated_metadata
view_entities_metadata
view_matches_source
view_merges_source
view_merges_sources_pivot
view_no_matches_source_count
view_relations_metadata
view_relations_orphans
view_workflow_entityType

Data Model Checklist for Reltio Analytics

Reltio tenant attribute names may require some adjustments or transformation before they are delivered to the Reltio Analytics environment. This is due to a number of constraints around how table names and fields are maintained in the target environment. To achieve consistency, Reltio Analytics applies the following automatic transformation rules.

Automatic Transformation Rules

Following are the rules related to automatic transformation applied by Reltio Analytics:

  1. If the Reltio tenant attribute name meets the Reltio Analytics schema naming convention, the Analytics' Reltio attribute remains unchanged.
  2. If the Reltio tenant attribute name contains invalid target characters, replace them with an underscore (_).
  3. If the Reltio tenant attribute name starts with a digit, include an underscore (_) in the beginning of the Analytics' Reltio attribute name.
  4. If the Reltio tenant attribute meets the naming collisions after the previous two steps, include a CRC32 hash function value at the end of the new Analytics' Reltio attribute name.
Note: Each column can include an optional description. The description is a string and is trimmed to a maximum length of 1024 characters.
Table 2. Examples
Reltio Tenant Attribute Name Analytics Field Name (Transformed as per the Rules)
MyField MyField
My-Field My_Field
ISO3166-2 ISO3166_2
Another-Field Another_Field_71e1316
another_field another_field_45b4f955

Other Considerations or Best Practices

Following are some considerations or best practices related to Reltio Analytics:

  • The data to the Reltio Analytics environment must be re-exported through a batch job if the following changes are made to the L3 attributes:
    • If the type and uri attributes are changed - this is necessary to reflect the new field names in the Reltio Analytics datasets.
    • If the skipInDataAccess attribute is changed from false to true or vice versa; (if skipInDataAccess is false then it is not considered for Analytics Export) - this is necessary to bring the attributes into Reltio Analytics that were previously skipped.
  • Once the initial dataset has been delivered, streaming can be turned on. We do not recommend turning on streaming before delivering the initial dataset, as it causes the events to be delivered into the empty tables making them useless.
  • Streaming can be turned off anytime. To bring the data back into sync, we recommend you to perform the batch data export again and then turn on streaming.

How does data land in Reltio Analytics?

Initially, a batch sync is performed and then streaming is turned on for continuous updates.

How is data organized in Reltio Analytics?

Data is streamed in real-time from tenants to the Reltio Analytics projects. The projects can either be generic or customer specific. For example, the project reltio-analytics-<customername> depends on customer requirements.

Regardless of the project structure, the data in a dataset is named after a combination of the following details:
  • Type of the tenant such as test, prod, dev, and so on
  • Tenant ID
For example, riq_dw_<env>_<tenantid> is a dataset that contains all the raw data from the tenant directly and riq_dw_views_<env>_<tenantid> is a dataset that primarily contains the necessary views that are required for out-of-the-box reports.
Inside the dataset there are few tables that come out-of-the-box for each tenant. These tables contain data about specific entity type and are named using the formula - entity_<Entity_Type>_source. The data from the tenant lands in the following tables:
  • entity_Contact_source
  • entity_Individual_source
  • entity_Location_source
  • entity_Organization_source
Furthermore, each tenant comes with some relations as well. These relations also come with out-of-the-box tables to store them in Reltio Analytics. The tables are named using the formula - relation_<Relation_Type>_source. You can view the following relations for your tenant:
  • Organization to Contact
  • Individual to Contact
  • Contact to Location
  • Organization to Location
Note: For each of these relations there is a separate table in Reltio Analytics. The columns of each table are mapped according to the L3 configuration of your tenant.

L3 to Reltio Analytics Table Mapping for Entities

Table name comes from the entity URI. For example, "uri": "configuration/entityTypes/Organization" results in a table called, entity_Organization_source.

Each row of the table represents a single entity of that given type. Since each entity has at least one but most likely multiple crosswalks, each table has a crosswalk column. In Reltio each crosswalk is considered as an object. Therefore, the crosswalk object is mapped to STRUCT in Reltio Analytics. Furthermore, as the crosswalks can be more than one for each entity or each row in our Reltio Analytics table, the column is also REPEATED.

The array of crosswalk objects in Reltio is mapped to REPEATED STRUCT in Reltio Analytics. The STRUCT has the following properties of a crosswalk as sub-columns:
  • values
  • attributes
  • uri
  • updateDate
  • deleteDate
  • reltioLoadDate
  • source
  • sourceTable
  • refEntityOrRelationId
  • createDate

These columns cannot be accessed directly. In order to query them you must first unnest the column. For more information, see the How to Unnest Columns section below.

Each entity table also contains the following columns:
  • tags - The column holds the tags that can be attached to the entity REPEATED column.
  • createdTime - Timestamp of the entity creation in the source system.
  • updatedTime - Timestamp of the last update of the entity in the source system.
  • startDate - The date when the entity was created in the Reltio system.
  • endDate - The date when the entity was soft deleted (if deleted) in the Reltio system.
  • uri - The URI of the entity.
  • updatedBy - The ID of the person who made the last update.
  • type - The type of the entity. Since one table holds information about one entity type, it is expected that all the records in the given table have the same value in this column.
  • createdBy - The creator of the entity.
  • eventTime - The time at which the event was logged.
  • versionTime - The version of objects in the dataframe.
  • deleted - Boolean value indicating whether the entity is deleted or not.
  • deletedTime - Timestamp of the entity’s deletion in the source system.
  • id - The ID of the entity in the Reltio system.
  • insertedTime - Timestamp of an entity inserted in the GBQ streaming table. For GBQ batch table, insertedTime is equal to the updatedTime.
Note: Each column can include an optional description. The description is a string and is trimmed to a maximum length of 1024 characters.
Each table contains the columns corresponding to the attributes defined in L3. For the attributes section there is an attribute record, which is a STRUCT holding REPEATED sub-column since attributes in Reltio are represented by arrays. For example, the Name attribute of the Organization entity to be configured in L3 is mapped to the attributes.Name column in Reltio Analytics as shown below.
{
  "label": "Name",
  "name": "Name",
  "description": "Name",
  "type": "String",
  "hidden": false,
  "important": false,
  "system": false,
  "searchable": true,
  "attributeOrdering": {
    "orderingStrategy": "LUD"
  },
  "uri": "configuration/entityTypes/Organization/attributes/Name",
  "skipInDataAccess": false
}

Therefore, the configuration/entityTypes/Organization/attributes/Name URI resolves itself around attributes.Name inside the entity_Organization_source table. You can see that the relation of the URI and the column is mapped inside Reltio Analytics.

There is no limit of the nesting levels. So, columns holding attribute values in Reltio Analytics such as attributes.Addresses.AddressLine1 are valid too.

The same process of representing attributes as columns is repeated for each attribute of the entity type.

Analogically, there are analytical attributes for the analyticsAttributes section of the entity configuration. Sometimes there are also other types of attributes such as geoLocationAttributes. For each of them there is a separate column of type RECORD.

L3 to Reltio Analytics Table Mapping for Relations

Table name comes from the relation type URI. For example relation with "uri": "configuration/relationTypes/IndividualToOrganization" results in a table called, relation_IndividualToOrganization_source.

Each row of the table represents a single relation of that given type. The crosswalks column is present as each relation is created or modified by one or more crosswalks. For more details on the crosswalks column, refer to the L3 to Reltio Analytics Table Mapping for Entities section above.

Note that each relation table also contains the following columns:
  • createdTime - Timestamp of the relation creation in the source system.
  • updatedTime - Timestamp of the last update of the relation in the source system.
  • startDate - The date when the relation was created in the Reltio system.
  • endDate - The date when the relation was soft deleted (if deleted) in the Reltio system.
  • uri - The URI of the relation.
  • updatedBy - The ID of the person who made the last update.
  • type - The type of the relation; since one table holds information about one relation type it is expected that all the records in the given table have the same value in this column.
  • createdBy - The creator of the relation.
  • eventTime - The time at which the event was logged.
  • versionTime - The version of objects in the dataframe.
  • deleted - Boolean value indicating whether the relation is deleted or not.
  • deletedTime - Timestamp of the relation’s deletion in the source system.
  • id - The ID of the relation in the Reltio system.
  • start
    • Column with type RECORD (meaning object) that holds one property ID.
    • Holds the ID of the start object in the relation (start object configured in L3).
    • Relevant information can be accessed using the following syntax: Start.Id
  • end
    • Column with type RECORD (meaning object) that holds one property ID.
    • Holds the ID of the end object in the relation (end object configured in L3).
    • Relevant information can be accessed using the following syntax: End.Id
  • insertedTime - Timestamp of a relation inserted in the GBQ streaming table. For GBQ batch table, insertedTime is equal to the updatedTime.
Note: Each column can include an optional description. The description is a string and is trimmed to a maximum length of 1024 characters.
Relations in Reltio can also be configured to hold attributes. Tables sometimes contain the attributes column as well. For example, consider that a relation is configured in L3 as shown below.
{
  "uri": "configuration/relationTypes/IndividualToOrganization",
  "startObject": {
    "uri": "configuration/relationTypes/IndividualToOrganization/startObject",
    "directionalContext": [
      {
        "uri": "configuration/relationTypes/IndividualToOrganization/startObject/directionalContext"
      }
    ],
    "objectTypeURI": "configuration/entityTypes/Individual"
  },
  "endObject": {
    "uri": "configuration/relationTypes/IndividualToOrganization/endObject",
    "directionalContext": [
      {
        "uri": "configuration/relationTypes/IndividualToOrganization/endObject/directionalContext"
      }
    ],
    "objectTypeURI": "configuration/entityTypes/Organization"
  },
  "implicit": false,
  "attributes": [
    {
      "label": "Position",
      "name": "Position",
      "type": "String",
      "hidden": false,
      "important": false,
      "system": false,
      "attributeOrdering": {
        "orderType": "ASC",
        "orderingStrategy": "LUD"
      },
      "uri": "configuration/relationTypes/IndividualToOrganization/attributes/Position",
      "skipInDataAccess": false
    }
  ],
  "direction": "directed"
}

This results in a table called source_IndividualToOrganization_source that has all the described columns. In addition to this, the attributes column that is a RECORD with a single property position is repeated. For more information, see the How to Unnest Columns section below.

Reltio Analytics Matches & Merges Tables

Since match and merge is a crucial part of the Reltio system, the merges_source and matches_source tables in Reltio Analytics are provided out-of-the-box for each tenant.

After match and merge takes place in the Reltio Platform, a new record is created in the merges_source table. The structure of this table is as follows:

  • mergeKey - This is the same as the loserId.
  • winnerId - The entity ID of the winner record in the merge.
  • loserId - The entity ID of the loser record in the merge.
  • matchRules - List of the match rule URIs that played a role in matching these records before they got merged.
  • timestamp - The timestamp of the event that occurred during merging.
  • type - The type of event that occurred. MANUAL type for merging manually, AUTO type if the tenant is configured to merge these records without explicit approval, and SPLIT type for an unmerge.
  • mergeRulesUris - The URI of the merge rules. The configuration/entityTypes/Organization/matchGroups/FuzzyNameExactAddress URI corresponds to the match rule defined in your L3 under the Organization entity type.
  • insertedTime - Timestamp of match or merge being inserted in the GBQ streaming table. For a GBQ batch table, insertedTime is equal to the timestamp field.
The matches_source table contains data about matches found by the rules defined in the L3. The structure of the table is as follows:
  • matchKey - A combination of the sourceId and targetId separated by a column.
  • sourceId - The entity ID of the source entity.
  • targetId - The entity ID of the target entity.
  • matchRules - The match rule used for matching records before they are merged. For events generated after a potential match is removed, the matchRules column displays the name of the removed potential match rule. For example:

    Entity IDs, 5Q7KlGW and 5Q7KpWm were matched because of the PersonByExactFirstName and PersonByExactLastName rules. The PersonByExactFirstName rule was later removed from the configuration and as a result the POTENTIAL_MATCH_REMOVED event was generated. This is indicated as SPLIT in the type column.

  • timestamp - The timestamp when the match took place.
  • type - The type of match. For example, POTENTIAL_MATCH.
  • matchScore - The score of the match that is configured in the L3 matchRules in the matchGroups section of each entity.
  • insertedTime - Timestamp of match or merge being inserted in the GBQ streaming table. For a GBQ batch table, insertedTime is equal to the timestamp field.
Note:
  • For POTENTIAL_MATCH records, the following values are considered valid in the matches_source table:
    • timestamp = 0
    • insertedTime = 1970-01-01 00:00:00 UTC
  • Currently, Reltio Analytics does not resolve links for matches after entities are merged. Therefore, the GBQ's Matches view contains match records of loser entities even after they are merged.
Example Consider a match group configured as shown below.
{
    "uri": "configuration/entityTypes/Individual/matchGroups/Rule1",
    "label": "Rule1:Exact(email,firstName)",
    "type": "automatic",
    "useOvOnly": "true",
    "rule": {...},
    "matchServiceClass": "com.reltio.businesslogic.match.providers.internal.InternalMatchService",
    "scoreStandalone": 90,
    "scoreIncremental": 0
}
This match group result in making a match between two individual records that are then populated in Reltio Analytics. The configuration/entityTypes/Individual/matchGroups/Rule1 URI is inserted in the matchRules column and the matchScore is set accordingly.

Nested Columns

Reltio Analytics tables sometimes include nested columns, which are like arrays or list of values for each column. This type of column is recognized by its mode that is REPEATED. Some of the most common nested columns are as follows:
  • crosswalks
  • tags
  • Most of the attributes such as attributes.Name, attributes.Title, and so on
How to Unnest Columns?
Unnesting columns in Reltio Analytics is the process of flattening a list of values in a column. This is a fairly simple task, but has great influence on performance. Therefore you must perform this task with caution. The following is an example of unnesting:
SELECT
  cw AS Crosswalk
FROM `<dataset_id>.entity_Individual_source`,
UNNEST (crosswalks) AS cw
You are recommended to give an alias of your column as it helps you to avoid duplicate column error. You can also unnest multiple columns at once as shown below.
SELECT
  cw AS Crosswalk,
  tg AS Tag
FROM `<dataset_id>.entity_Individual_source`,
UNNEST (crosswalks) AS cw,
UNNEST (tags) as tg
You can also partially select sub-columns if your nested column is a STRUCT as shown below.
SELECT
  cw.Uri AS CrosswalkUri,
  cw.Source AS CrosswalkSource,
  tg AS Tag
FROM `<dataset_id>.entity_Individual_source`,
UNNEST (crosswalks) AS cw,
UNNEST (tags) as tg

Reltio Analytics Streaming

Since the Reltio system is very dynamic, we have an option to deliver data to Reltio Analytics tables in real-time to make the Reltio Analytics data and Reltio tenants consistent. Reltio Analytics streaming writes the data into the separate tables.

Dataset Views

Dataset with views named views_riq_dw_<env>_<tenant> contains views for selecting from both batch and streaming tables to get a whole set of data. For example, entities of type HCP contain the entity_HCP view.

We recommend you to use views instead of tables because the streaming table naming format might be changed in the future but views may have the same interface for names and fields.

Tenant Model Modification

Business requires tenant model change that affects attributes or the presence of the whole entity or relation types.

Table column attributes of type STRUCT represents the entity or relation attributes. If the tenant model has changed a set of attributes, then the Reltio Analytics table has an obsolete structure.

In order to have an up-to-date table, we suggest you to reach out to Reltio Support so they can perform a re-sync of your data. After re-syncing, the datasets are populated by new tables and views with the correct attribute structures.

Additional Considerations

Querying crosswalk data from Reltio Analytics for entities containing relations may seem incorrect as a non-existing crosswalk is returned. This issue comes from querying crosswalks with non-empty refEntityOrRelationId columns. To prevent such mismatches and to retrieve only the real crosswalks, add the following sample as a condition to your queries:

SELECT
  cw
FROM
  `customer-facing.views_riq_dw_geu_test_hAPWDOwXqmKhe3C.entity_Organization` T,
  unnest(T.crosswalks) as cw
WHERE
id = '1HfK1O15' and cw.refEntityOrRelationId = ''

Additional Resources

For more information about concepts, functions, functions and operators, statements and scripting, see Standard SQL Reference.