Reltio Data Science FAQ

Find answers to Frequently Asked Questions (FAQs) about Reltio Data Science.

  1. How is the initial data load handled?

    The Data Sync job API migrates data from the Platform to the Reltio Data Science Storage. It must be triggered once after enabling Reltio Data Science or whenever full data sync from Platform to Reltio Data Science Storage is needed.

  2. How is the cluster size calculated and managed?

    Cluster size is automatically calculated based on different parameters of your tenant. Clusters are automatically bootstrapped and terminated after job execution.

  3. How are updates handled?

    Updates are streamed in near real-time to the Reltio Data Science S3 storage. Approximately within a couple of minutes, data is available for use with Reltio Data Science DataAccess API.

  4. How are schema changes handled?

    Changes in schema reflect automatically on Reltio Data Science when you log in to AnalyticsFramework in Spark Notebook. To get the new schema when you are already logged in to AnalyticsFramework in Spark Notebook, you must log off and log in again.

  5. What are the default parameters for the jobs?

    For information about default parameters for jobs, see Data Export Task.

  6. Is there a particular order in which manual Data Sync jobs must be run?

    Data Sync jobs can be run in any order.

  7. What are the roles needed to run Data Sync jobs?

    To run Data Sync jobs you must have ROLE_ANALYTICS or ROLE_TENANT_ADMIN role privileges.

  8. How to cancel jobs?

    For information about canceling jobs, see Cancel Job Execution.

  9. How can I use my own custom Protected AWS S3 bucket with a Reltio provided Qubole account?

    Any S3 bucket can be accessed from the Reltio provided Qubole account as long as you have permission to access that S3 bucket.

  10. How can I use a custom bootstrap with a Reltio provided Qubole account?
    Perform the following steps to use a custom bootstrap:
    1. Log in to Qubole.
    2. Navigate to Cluster > Cluster Node bootstrap.
    3. Update the shell script with relevant changes.
  11. How often are analytical attributes updated?

    Analytical attributes can be updated any number of times based on the user request for DataPersist jobs, but it is recommended not to update more than once a day.

  12. Can analytical attributes be configured to be updated on a set schedule?

    Yes. The DataPersist job can be scheduled for configuring analytical attributes.

  13. What is the recommended way to update analytical attributes? Run scheduled batches to compute all of the attributes at once and update in Reltio or Use Spark Streaming to compute them as updated data comes in?

    It is recommended to schedule DataPersist job as batches to update Analytics attributes.

  14. Is there a limit on the number of analytical attributes?

    It is recommended to limit the attributes to less than 200 for better performance.

  15. How is the data synced from the Reltio platform to Qubole?

    As an initial one-time load, full data can be synced. Subsequently, only delta changes can be synced in real time.

  16. What can I do in case of inconsistencies?

    If you come across inconsistencies perform full data sync.

  17. Who manages the administration of the Qubole environment?

    The Qubole environment is managed by the Reltio Manufacturing team.

  18. Can I run HIVE queries on Reltio data in Qubole?

    Hive queries on Reltio data in Qubole are currently not supported.

  19. Do customers get a separate Qubole environment for each of their tenants (For example: prod, dev, test, etc.)?

    Reltio works with one Qubole environment for all tenants.

  20. What is the best way to verify/identify data in Qubole with the data in the tenant?

    The best way to verify/identify data in Qubole is to compare attributes from L3 configuration and Qubole dataframe schema. For more information about L3 configuration, see GET Configuration.

  21. How can I verify if Reltio Data Science for data science is on for my tenant?

    To verify if Reltio Data Science is on for your tenant, run the GET {{riUrl}}/api/v1.0/access/{{tenantID}}/enabled query.

  22. How can I view my Reltio Insights for data science usage?

    Reltio Reporting describes Reltio insights for data science as well as data analytics.

  23. What are the optimizations performed by Reltio on S3 storage? "Perform optimization of S3 storage for performance" means a Compaction job. How much can we benefit from this? This seems similar to a vacuum command in Amazon Redshift.

    Over time, S3 storage will accumulate updates streamed from Reltio. These updates are picked up at the time of data access and it can get expensive to accumulate updates over a period of time. As an experiment we changed every object in the database which caused updates to be sent to S3. The data access performance improved by three times after compaction.

  24. When it says, 'it creates data frame and query data frame', is a table supported?

    Dataframes and Tables are supported. The two terms are sometimes used interchangeably.

  25. "Load data back to Reltio". What does it mean by loading data into Reltio? Does it mean Cassandra?

    Yes, it means Cassandra. Reltio uses Cassandra as the primary datastore and Elastic as the search engine. "Load data back to Reltio" indicates that Data Persist sends updates to both Cassandra and ES.

  26. What is Reltio Data Science Spark SDK? How is it exposed as Spark libraries?

    Reltio Data Science Spark SDK is a Scala library that provides classes to access Reltio data in Spark and persist data from Spark back to Reltio. This library should be attached to the cluster.

  27. Why will I use Reltio Data Science SDK instead of Qubole Data Service (QDS)? What are the benefits?

    There is no analog of Reltio Data Science SDK. You have to use it if you want to access any of the data stored in Reltio.

  28. It has been mentioned that data needs to be exported to S3 for analytic processing (Data Export Job). Can Reltio process data directly from Cassandra?

    For security and performance reasons, direct access to Cassandra from Spark is not provided.

  29. When Reltio says a job, is this a Spark job and I use Spark processing power? If yes, why would I use Reltio API and not Spark API?

    Reltio Data Science jobs are launched in the Reltio Spark cluster. This cluster is private and has a direct connection to the main database, so it is only accessible by Reltio staff and never exposed outside VPC. These jobs are bundled into the Reltio Spark Connector package and come as part of it. Using Reltio API provides an added layer of security.

  30. What are the Reltio objects-tables in Cassandra?

    Entities, relations, and interactions defined in Reltio tenant metadata configuration are Reltio objects.

  31. Why would I use Data Access API of Reltio instead of Spark API/QDS API? What are the benefits?

    To access data managed by Reltio you must use Reltio Data Science Data Access. If you access flat files directly from your own S3, you must use Spark's native API (data sources API to read/write data).

  32. Why do I need to create JAR files from my Scala script? Can I just execute scripts directly?

    You can, but it is a better practice to package your scripts into JAR files, since it will make it easier to manage, test, and distribute. QDS provides integration with SBT to automatically build, deploy, and attach libraries to clusters. Executing scripts directly may not leverage the benefits of using Reltio data management.

  33. Does Reltio's Data Persist API store data in memory, Redshift, or, S3? When reading data, which storage persists data, Cassandra or Reltio-owned S3?

    Reltio can only persist data back to Reltio primary storage (Cassandra) and add this data to ES index. To persist data in Redshift, you must write custom code or use an open source Spark-Redshift connector. For more information, see https://github.com/databricks/spark-redshift.

  34. Regarding persist API's automatic index, does it index data on files? Based on ID columns? Can I index on composite columns?

    This is a full-text index in Elasticsearch.

  35. Persist API needs ID column to persist data. This is understandable for a dimension table. But what about a cross- reference table?

    You can only persist dimension data - entities and relations.

  36. Which Audit fields are available for loading from Reltio?

    The following system fields are exposed via DataFrames on entities, relations, and interactions: createdBy, createdTime, updatedBy and updatedTime. For data created before this feature was introduced, you need to re-export data using a Data Export job to enable exposing the fields.

  37. For entities we can extract crosswalk, can we do the same for relations?

    Crosswalks are currently available for Entities. We are planning to support Relations in upcoming releases.

  38. How can we update interactions?

    Interactions can be updated by uploading a new batch with the same crosswalks, by using Reltio Data Science Jobs API.

  39. How does Data Access handle cases with non-supported fields (for example, explode for BLOB fields)?

    Field types that are not supported are not processed. For BLOB fields, Data Access will log a warning messages like:

    {Field} is of type BLOB which is not only supported currently but also omitted from processing.
  40. When extracting a Dataframe using Reltio Data Science API with Incremental Data Access, when there are no data updates, the call returns an empty DataFrame. Example of DF:
    val conctactDF: DataFrame = af.entities("configuration/entityTypes/Contact")

    Expected exception:

    java.lang.IllegalArgumentException: 
        		Data has never been imported into the system: 
        		[s3n://reltio.tenant.test.z2w-s-gt-o-lku-x-y8-t9/] 
        		[s3n://reltio.tenant.test.z2w-s-gt-o-lku-x-y8-t9/entityTypes/Contact]

    The reason is that the DF contained events of deleted entities or delayed events. The exception is usually thrown if there are no updates at all. In this case, updates were there, but were not included in the DataFrame (by design). While we are investigating ways to address this situation, the workaround for Reltio Data Science users is to check if the DF is empty, catch the exception in the code, and check the size of the DataFrame.

  41. How are incompatible values handled in the field type (For example: string value in Int type)?

    When a field has a value that is incompatible with the field's type, it is considered as null in the DataFrame column.

  42. How are field types unknown to Spark handled?

    Field types unknown to Spark are converted to String type.

  43. Does Reltio Data Science take into account user's privileges to access Reltio objects?

    When a user defines a Dataset, Reltio Data Science leverages metadata security to show only what the user has access to.