Comparator Classes

The comparator classes define the behavior of the comparison operators.

Your match rule will contain one or more high level Comparison Operators (Exact, ExactOrNull, ExactOrAllNull, notExactSame, and Fuzzy) which operate on attributes you select for your rule. Notice that each has either the word Exact or Fuzzy in it. The actual behavior of the Exact or Fuzzy aspect of the Comparison Operators is governed by the comparator class you must map to the attribute. For example, you may choose to leverage the ExactOrNull operator on a person’s suffix attribute. The OrNull part is easy to understand but what type of comparison it actually performs to determine if the two suffix values are exactly the same, that depends. You might map the BasicStringComparator class, or perhaps the StringCharactersComparator class; each has its own way of determining exactness and it's up to you to select the comparator class that makes the most sense for your data and your strategy.

List of Comparator Classes

A comparator class can be defined for each attribute with exact, fuzzy, exactOrNull and exactOrAllNull match rules. Currently, Reltio platform does not support user defined comparator classes. The following comparators are available:

Table 1. List of Comparator Classes
Comparator class: com.reltio.match.comparator.* Description
BasicStringComparator This comparator treats the attribute values as strings and returns true if the strings are identical. All characters are supported. It is a good starting point for a basic exact match use case.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner with no additional processing to remove special characters or need for fuzzy variations of the strings.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Supports non-latin character sets.
  • Guidance regarding Match Token Class: ExactMatchToken class.
Note: If your rule does not define a comparator class, the match engine will use this comparator class and the ExactMatchToken class.
DamerauLevenshteinDistance Consider the values of two attributes as S1 and S2. This comparator counts n, the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently. The comparator returns true if n is:
  • =0 (that is, the strings are already equal)
  • <=1 where the largest raw string length is <=4
  • <= 2 where the largest raw string length is > 6 and <=10

Example, to make John equal to jon, n = 1. The comparator returns true.

Example, to make John equal to jonathon, n = 6; the comparator will return false.

  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for Exact, ExactOrNull, ExactOrAllNull, and notExactSame, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare two strings that might have spelling inconsistencies.
  • Typical use cases are matching words that are believed to have spelling errors.
  • Supports non-latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextAndNumberMatchToken class. If a match token class is not defined, the FuzzyTextAndNumberMatchToken class is used by default.
DynamicDamerauLevenshteinDistance Same as DamerauLevnshteinDistance comparator but this comparator supports a greater number of operations on longer strings.
  • <= 1 where the largest raw string length is <=6
  • <= 2 where the largest raw string length is > 6 and <=10
  • <= 3 where the largest raw string length is > 10 and <= 20
  • <= 4 where the largest raw string length is > 20 and <=30
  • <= 5 where the largest raw string length is > 30
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that might have spelling mistakes.
  • Typical use cases are matching words that are believed to have spelling errors.
  • Support non-latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextAndNumberMatchToken class. If a match token class is not defined, the FuzzyTextAndNumberMatchToken class is used by default.
MetaphoneComparator The comparator returns true if the two strings are phonetically equal based on the Metaphone algorithm. The Metaphone algorithm is thought to improve upon the Soundex algorithm because it takes into consideration various inconsistencies in the English spelling and pronunciation.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-latin character sets.
  • Guidance regarding Match Token Class: DictionaryStatsPhoneticFuzzyToken class. If a match token class is not defined, the DictionaryStatsPhoneticFuzzyToken class is used by default.
DoubleMetaphoneComparator The comparator returns true if the two strings are phonetically equal based on the Double Metaphone algorithm. The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-latin character sets.
  • Guidance regarding Match Token Class: DoubleMetaphoneMatchToken class. If a match token class is not defined, the DoubleMetaphoneMatchToken class is used by default.
SoundexComparator The comparator returns true if the two strings are phonetically equal based on the Soundex algorithm.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-latin character sets.
  • Guidance regarding Match Token Class: SoundexTextMatchToken class.
  • If a match token class is not defined, the SoundexTextMatchToken class is used by default.
  • Additional Guidance: See other phonetic comparator options such as the Metaphone and Double Metaphone comparators.
StringCharactersComparator This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns true if the two resulting strings are identical but both must not be empty and null. If both strings are empty or null, the comparator returns false.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner but do so with special characters removed automatically.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Does not supports non-latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextMatchToken class. If a match token class is not defined, the FuzzyTextMatchToken class is used by default.
StringComparatorIgnoringNulls The comparator returns true if the strings are identical AND both strings are non-zero in length, and non-null, and do not equal to string ‘null’.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner but also test that both strings exist and have meaningful info.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Supports non-latin character sets.
  • Guidance regarding Match Token Class: ExactMatchToken class. If a match token class is not defined, the ExactTextMatchToken class is used by default.
  • Additional guidance: Do not use with ExactOrNull, and ExactOrAllNull comparison operators because both of those are designed to return true in cases where a value is null, whereas the StringComparatorIgnoringNulls comparator class will actually return false in these cases. Thus this comparator is incompatible with ExactOrNull and ExactOrAllNull.
PhoneNumberComparator The comparator strips the strings of all non-numeric characters. The comparator is specifically expecting a result of 10 digits for the purpose of comparison. So if either of the resulting strings is less than 10 characters, the comparator returns false. Whereas if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns true.

Examples:

  • (818)777-09876 and 818-777-0987 will produce 81877709876 and 8187770987 and return false
  • +0177(818)777-0987 and 818-7770987 will produce 8187770987 and 8187770987 and return true
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare 10-digits of phone numbers in a basic manner.
  • Typical use cases are Exact matching on phone numbers.
  • Supports non-latin character sets.
  • Guidance regarding Match Token Class: PhoneNumberMatchTokenclass. If a match token class is not defined, the PhoneNumberMatchToken class is used by default.
OrganizationNamesComparator For each attribute being compared, the comparator parses the attribute’s string (for example, IBM Services Corp) into a collection of words. It then compares one collection to the other and if at least 60% of the words in the collections are the same, the comparator returns true.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare company names and you believe there will be inconsistencies due to each having a slightly different set of words.
  • Typical use cases are for matching on organization name.
  • Does not support non-latin character sets.
  • Guidance regarding Match Token Class: OrganizationNameMatchToken class. If a match token class is not defined, the OrganizationNameMatchToken class is used by default.
  • Additional guidance: This comparator does not remove garbage words.
AddressLineComparator

There is an exact matching on address street.

Algorithm:

  1. The two values to compare are transformed to lowercase whereas letters and numbers are retained. If both transformed values start with po-box then their equality is checked. If they are the same the comparator results in a match.
  2. If the values do not represent a po-box the comparator normalizes in a different way: the values are transformed as lowercase. The letters are retained.
  3. The values are splitted to words.
  4. The comparator removes noise (garbage) words like st, ave, rd, etc.
    Note: If all the words are in the noise dictionary then all the initial words are retained, for example, 123 ave value contains digits and noise word 'ave' then the value to compare is just ave.
  5. The two sets of words obtained by splitting are compared to find how many words of the first set (having M1 words) exist in the second set (having M2 words). The number of same words N is used to answer the question if the initial values are the same: N/M1+M2-N >=0.6. If there is inequality then the comparator results in a match.
Note:
  • Displays same behavior for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If selected for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Does not support non-latin character sets.
  • Regarding Match Token Class: For AddressLineMatchToken class, if a match token class is not defined, the AddressLineMatchToken class is used by default.
  • Click Address Line Garbage to see the list of address line garbage words.
RangeNumericComparator Returns true if the difference in magnitude between two values falls within a specified range. The range can be expressed as an absolute value (for example, 5) or as a percentage (for example, 10%). For example, if the threshold is 5, and the two values are 12 and 16, then the comparator returns true. For example, if the threshold is 10%, and the two values are 12 and 16, then the comparator returns false. The percentage is applied to the smallest of the two values, thus in the case above, the threshold is calculated as 12*.1 = 1.2.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare the difference between two numbers.
  • Typical use cases are for comparing product prices, age, height, and so on.
  • Supports non-latin character sets.
  • Guidance regarding Match Token Class: RangeNumericMatchToken class. If a match token class is not defined, the RangeNumericMatchToken class is used by default.
  • Additional guidance: If the RangeNumericMatchToken class is chosen in association with this comparator class, then all applicable parameters defined for the comparator class will also be used for the RangeNumericMatchToken class.

See example below showing the proper structure for setting required parameters in this class.

  {
                            "mapping": [
                                {
                                    "attribute": "configuration/entityTypes/Household/attributes/Address/attributes/Zip5",
                                    "parameters": [
                                        {
                                            "parameter": "threshold",
                                            "value": "2"
                                        }
                                    ],
                                    "class": "com.reltio.match.comparator.RangeNumericComparator"
                                }
                            ]
                        },
BasicTokenizedOrganizationNameComparator Compares two organization names. Comparison is done in two steps:
  • Comparator normalizes each value. Value is split into separate words. All garbage words (inc, corp, services, and so on) are removed from the words list.
  • Comparator compares the words list. If there is at least 60% of the same words in the words list obtained in the first step, then the values are considered the same.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare company names and you believe there will be inconsistencies due to each having different garbage words.
  • Does not support non-latin character sets.
  • Guidance regarding Match Token Class: BasicTokenizedOrganizationNameMatchToken class. If a match token class is not defined, the BasicTokenizedOrganizationNameMatchToken class is used by default.
Note: Click Organization Names Garbage to see the list of organization name garbage words.
CustomComparator This comparator is like a container for unlimited groups of other comparators. Each group is configured using comparator parameters that are also grouped. Each parameter group contains a list of parameters and these parameter groups must be configured as a list of group elements. Each group is a list of parameters. The following parameters are supported:
  • className - The name of the Comparator class name which is used to compare values. Default value - BasicStringComparator
  • classParams - Parameters (if any) for the comparator class.
  • pattern - Regular expression pattern which is used to match and extract group specific values from the original attribute values. If this parameter is not specified, the original values are used as group specific values.
  • noiseDictionary - Name of a predefined noise words dictionary or URL of a customer file with dictionary words. Noise dictionary contains words that are excluded from the attribute values before they are compared by the comparator. Possible values are:
    • addressLine
    • organizationName
    • internationalOrganizationName
    • eqfOrganizationName
    • foodOrganizationName
The scenarios are describes for possible values while these are enabled:
  • useNoiseIfEmpty - If enabled and the value contains noise words only, then no noise words are removed from the value. This indicate tokens are generated for noise words. Default is true.
  • useStemmer - If enabled, the words are stemmed to their base form. Default is false.
  • useSoundex - If enabled, the words are replaced by their soundex codes. Default is false.
  • wordDelimeter - Delimiter which is used while concatenating the words into one value before passing the value to the provided comparator. Default is " " (white space).
  • sortWords - If enabled, words inside a group are sorted alphabetically before passing the value to the provided comparator. Otherwise, the original order of words is maintained.
The algorithm goes through configured groups, and for each group it does the following:
  1. Split the value into words according to the specified regexp pattern.
  2. Construct a list of lower-case words.
  3. Stemmize words (if enabled).
  4. Remove noise words (if noise dictionary specified).
  5. Replace words with soundex codes (if enabled).
  6. Sort words (if enabled) and join into one value using word delimiter.
  7. Words are joined using a specified word delimiter and passed to the provided comparator.

Total result is calculated as an expression: <part1_result> AND... AND <partN_result>.

DistinctWordsComparator Compares values by distinct words based on the parameters specified. The following parameters are supported:
  • pattern - Regular expression pattern which is used to match and extract distinct words from the value. Default value is \\w+
  • threshold - The minimum number of words to be evaluated to consider the compared values as matches. The threshold value can be an absolute value or a percentage of words. Default value is 50%.
  • thresholdChars - The minimum number of characters to be evaluated to consider the compared values as matches. The thresholdChars value can be an absolute value or a percentage of characters.
  • noiseDictionary - The name of a predefined noise words dictionary or URL of a custom file with noice dictionary words. A noise dictionary contains words that are excluded from the attribute values before they are compared. Possible predefined values are:
    • addressLine
    • organizationName
    • internationalOrganizationName
    • eqfOrganizationName
    • foodOrganizationName
  • useNoiseIfEmpty - If enabled and the value contains noise words only, then no noise words are removed from the value. This indicates that tokens are generated for noise words as well. Default is true.
  • useStemmer - If enabled, the words are stemmed to their base form. Default isfalse.
  • useSoundex - If enabled, the words are replaced by their soundex codes. Default is false.
The algorithm is based on the following:
  1. Split the value into distinct words according to the specified regexp pattern.
  2. Construct a sorted set of lowercase words.
  3. Stemmize words (if enabled).
  4. Remove noise words if noise dictionary is specified.
  5. Replace words with soundex codes (if enabled).
  6. Compare sets of words.

If the intersection is greater than the value of the the threshold parameter, then the values are considered matches.

If comparison by words failed to match the values and the thresholdChars value is specified, then comparison is performed by character histograms. For both sets of words, the character histogram is calculated (characters are 0-9 and a-z). If the intersection is greater than the value of thresholdChars parameter, then the values are considered matches.

When the option thresholdChars is enabled, use ignoreInToken. If a match token class is not defined, DistinctWordsMatchToken is used automatically.

ExactMultiComparator
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators , the comparator’s logic is used.
CrossMultiComparator
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare two or more attributes which could be mixed while filling the values.
  • Guidance regarding Match Token Class: CrossMultiToken class. If a match token class is not defined, the CrossMultiToken class is used by default.
ProximateGeoComparator
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Recommended for cases where you wish to define a maximum distance between two locations that is considered to be semantically the same as if the two locations shared the same location.
  • Typical use cases are for comparing the longitude and latitude of two objects where there might be some data quality problems regarding the longitude and latitude of the objects. Thus you might use this capability to declare that if the distance between the two objects is < 500 feet, the two locations are considered the same.
  • Guidance regarding Match Token Class: ProximateGeoToken class. If a match token class is not defined, the ProximateGeoToken class is used by default.
ExactOrNullCrossMultiComparator
  • Recommended for cases where you want to compare two or more attributes that could be mixed while filling the values and where one of the attributes can be null.
  • The match operand with the comparator does not produce the match tokens. As a result, the rule must have an additional operand that produces the match tokens and initiates match comparison.

Example of ProximateGeoComparator

"comparatorClasses": {
 "mapping": [
   {
     "attribute": "configuration/entityTypes/Location/attributes/AddressLine1",
     "class": "com.reltio.match.comparator.AddressLineComparator"
   },
   {
     "attribute": "configuration/entityTypes/Location/attributes/LatLong",
     "parameters": [
       {
         "parameter": "distance_miles",
         "value": "0.2"
       }
     ],
     "class": "com.reltio.match.comparator.ProximateGeoComparator"
   }
 ]
},
"multi": [
 {
   "uri": "configuration/entityTypes/Location/attributes/LatLong",
   "attributes": [
     "configuration/entityTypes/Location/attributes/GeoLocation/attributes/Latitude",
     "configuration/entityTypes/Location/attributes/GeoLocation/attributes/Longitude"
   ]
 }
]

ProfileA Latitude, Longitude ["59.939782", "30.314548"]
ProfileB Latitude, Longitude ["59.938524", "30.315995"]
Result -> True, distance between 2 geo points is ~0.1 mile

ProfileA Latitude, Longitude ["59.939782", "30.314548"]
ProfileC Latitude, Longitude ["59.936433", "30.317226"]
Result -> False, distance between 2 geo points is ~0.25 mile

Example of CrossMultiComparator

"multi": [
        {
          "uri" : "configuration/entityTypes/HCP/attributes/MultiGroup1",
          "attributes" : [
            "configuration/entityTypes/HCP/attributes/FirstName",
            "configuration/entityTypes/HCP/attributes/LastName"
          ]
        }
      ],
      "comparatorClasses": {
        "mapping": [
          {
            "attribute": "configuration/entityTypes/HCP/attributes/MultiGroup1",
            "class": "com.reltio.match.comparator.CrossMultiComparator"
          },
          {
            "attribute": "configuration/entityTypes/HCP/attributes/FirstName",
            "class": "com.reltio.match.comparator.BasicStringComparator"
          },
          {
            "attribute": "configuration/entityTypes/HCP/attributes/LastName",
            "class": "com.reltio.match.comparator.BasicStringComparator"
          }
        ]
      }

ProfileA FirstName, LastName [“John”, “Doe”]
ProfileB FirstName, LastName [“Doe”, “John”]

Example of ExactOrNullCrossMultiComparator

{
  "rule": {
    "exact": [
      "configuration/entityTypes/HCP/attributes/MiddleName"
    ],
    "multi": [
      {
        "uri": "configuration/entityTypes/HCP/attributes/MultiGroup1",
        "attributes": [
          "configuration/entityTypes/HCP/attributes/FirstName",
          "configuration/entityTypes/HCP/attributes/LastName"
        ]
      }
    ],
    "comparatorClasses": {
      "mapping": [
        {
          "attribute": "configuration/entityTypes/HCP/attributes/MultiGroup1",
          "class": "com.reltio.match.comparator.ExactOrNullCrossMultiComparator"
        }
      ]
    }
  }
}


ProfileA FirstName, LastName, MiddleName["John", null, "Bob"]
ProfileB FirstName, LastName, MiddleName["Doe", "John", "Bob"]

Example of CustomComparator

{
 "attribute": "configuration/entityTypes/Location/attributes/AddressLine1",
 "parameters": [
   {
     "parameter": "groups",
     "values": [
       {
         "pattern": "[a-zA-Z]+"
       },
       {
         "pattern": "[\\d]+"
       }

     ]
   }
 ],
 "class": "com.reltio.match.comparator.CustomComparator"
}

ProfileA AddressLine1 ["110, Street Red Linden"]
ProfileB AddressLine1 ["Street Red Linden 110"]
Result -> True, after applying the patterns both strings will be splitted to 2 groups: [“Street Red Linden”, “110”]. These values will be compared by pairs with BasicStringComparator.

ProfileA AddressLine1 ["110, Street Red Linden"]
ProfileC AddressLine1 ["Street Red Linden"]
Result -> False, after applying the patterns profileA will be splitted to 2 groups: [“Street Red Linden”, “110”], profileB to [“Street Red Linden”, “”]

{
 "attribute": "configuration/entityTypes/Location/attributes/AddressLine1",
 "parameters": [
   {
     "parameter": "groups",
     "values": [
       {
         "pattern": "[a-zA-Z]+",
         "noiseDictionary": "addressLine",
         "className": "com.reltio.match.comparator.SoundexComparator",
         "useNoiseIfEmpty": "true"
       },
       {
         "pattern": "[\\d]+"
       }

     ]
   }
 ],
 "class": "com.reltio.match.comparator.CustomComparator"
}

ProfileA AddressLine1 ["24 Linden Drive"]
ProfileB AddressLine1 ["24 Lynden Beach Dr"]
Result -> True, after applying the patterns and removing noise words profileA will be splitted to 2 groups: [“Linden”, “24”], profileB to [“Lynden”, “24”]. “Lynden” and “Linden” have the same soundex code.

Relevance Calculation for Different Comparators

Comparator class is used by the Reltio to compare values of the attributes.

You must describe a comparison algorithm or the relevance score calculation algorithm. Every comparator class has two major responsibilities:
  • A comparator class can compare strings to find out if these strings can be considered
  • A comparator class helps to calculate how close are the strings (relevance score).

The following table shows how to perform Relevance calculation for different Comparators:

The relevance value ranges from 0 to 1. When the relevance is closer to 1 the entities are more identical.

Consider the values of two attributes as S1 and S2 for calculating relevance score.

Table 2. Relevance Calculation for Different Comparators
Comparator Class Description
AddressLineComparator

If addressLines S1 and S2 start with PO-box number:

  • If PO-box values are equal then the relevance is 1.

  • If PO-box values are different, then the relevance is 0.

If addressLines do not start with a PO-box value, the calculation of the relevance includes the next steps:

  • Split S1 and S2 by words.

  • Remove all S1 to S2 most efficiently. If the line consists of the noise words only, this step is skipped.

  • The relevance is a ratio of common words count to count of all unique words in compared lines.

Ncommon/S1WordsCount+S2WordsCount

AlwaysTrueComparator Relevance is 1.
BasicStringComparator If strings are equal then the relevance is 1. Otherwise the relevance is 0.
BasicTokenizedOrganizationNameComparator

The relevance calculation includes the next steps:

  1. Split the lines by words.
  2. Remove all noiseWords from noise dictionary organizationName). If the line consists of the noise words only, this step would be skipped.
  3. The relevance would be a ratio of common words count to count of all unique words in compared lines.

Ncommon/S1wordsCount+S2wordsCount - Ncommon

CrossMultiComparator

Relevance is calculated for each combination of attributes. The biggest value is picked as a result.

For each combination of attributes the relevance is a sum of relevances per each attribute R attributedivided by count of attributes Nattributes

R attribute1+ R attribute2+.…... R attributeN/Nattributes

DamerauLevenshteinDistance

This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently.

The relevance counts as 1 - distance /max (S1_length, S2_length).

The smaller the distance, the closer the relevance is to 1.

DistinctWordsComparator

The algorithm is based on the following:

  1. Split the value into distinct words according to the specified regexp pattern.
  2. Construct a sorted set of lowercase words.
  3. Apply stemmer to the words (if enabled).
  4. Remove noise words if noise dictionary is specified.
  5. Replace words with soundex codes (if enabled).
  6. Compare sets of words.

The relevance is a ratio of common words count N common to the bigger value of the words countN common/ / max (S1 _wordsCount , S2 wordsCount )

Additional Case: If the thresholdChars value is specified, then relevance calculation is performed by character histograms. For both sets of words, the character histogram is calculated (characters are 0-9 and a-z).

The relevance is a ratio of common chars count Ncommonto the bigger chars count S1charsCount, S2charsCount

Ncommon/ max(S1 charsCount, S2 charsCount)

In this case relevance will be the bigger one between relevance by words and relevance by chars.

DoubleMetaphoneComparator

The comparator returns 1 if the two strings are phonetically equal based on the Double Metaphone algorithm, otherwise the relevance is 0.

The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm.

DynamicDamerauLevenshteinDistance

This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently.

The relevance counts as 1 - sqr(distance) / sqr(max (S1_length, S2_length)).

The smaller the distance, the closer the relevance is to 1.

ExactMultiComparator Returns 1 if all compared attributes are equals, otherwise returns 0
MetaphoneComparator The comparator returns 1 if the two strings are phonetically equal based on the Metaphone algorithm, otherwise the relevance is 0.
OrganizationNamesComparator

The relevance calculation include the next steps:

  1. Split the lines by words.
  2. The relevance is a ratio of common words count to count of all unique words in compared lines

commonWordsCount / (S1_wordsCount + S2_wordsCount - commonWordsCount)

PhoneNumberComparator The comparator strips the strings of all non-numeric characters. The comparator specifically expects a result of 10 digits for comparison purposes. Therefore, if either of the resulting strings is less than 10 characters, the comparator will return 0; else if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns 1.
ProximateGeoComparator

The relevance is

1 - distance between 2 comparing points / distance_threshold

The closer the comparing geo-points to each other, the closer the revance is to 1.

RangeNumericComparator

Consider two attribute values (numbers) as N1 and N2.

The relevance is 1 - diff between two numbers / max (N1, N2)

For example, the two values are 12 and 16, then the comparator returns 1 - 4/16 = 0.75.

SoundexComparator The comparator returns 1 if the two strings are phonetically equal based on the Soundex algorithm, otherwise returns 0.
StringCharactersComparator This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns 1 if the two resulting strings are identical. But, both the strings must not be empty and null, otherwise it returns 0.
StringComparatorIgnoringNulls The comparator returns 1 if the strings are identical AND do not equal to string null, otherwise it returns 0.