Match Token Classes

The list of match token classes generate match tokens that identify candidates to be matched.

The following table explains the available match token classes:

Table 1. Match Token Generators
Match token class Description Potential number of tokens to generate
ExactMatchToken This class generates a single match token that uses the lowercase version of the attribute value as the token. However, only letters, digits, and spaces are used in the token. Other characters are removed. For example, if the attribute contains John then the token will be john. If the attribute contains John B, then the token will be John B.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, ND notExactSame.
  • Intended for use with comparator class: BasicStringComparator.
  • Supports non-latin character sets.
1
ExactNumberMatchToken This class generates a match token that represents the numeric characters from the string. For example, if the attribute value is ACL89786291D, the token will be 89786291.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame)
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, and notExactSame.
  • Intended for use with comparator class: BasicStringComparator.
  • Supports non-latin character sets.
1
SoundexTextMatchToken This class generates a single match token that represents the phonetic representation of the value based on the Soundex algorithm.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, notExactSame.
  • Intended for use with comparator class: SoundexComparator
  • Does not support non-latin character sets.
1
DoubleMetaphoneMatchToken This class is based on the Double Metaphone algorithm that can generate two match tokens, a primary and a secondary code for a string. The second token supports ambiguous cases such as character ordering and double characters misspellings, such as John and Jhon.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, and notExactSame.
  • Intended for use with comparator class: DoubleMetaphoneComparator.
  • Does not support non-latin character sets.
 
FuzzyTextMatchToken This class generates a token for each common spelling up to a maximum of six tokens. For example, 'Michael' can be misspelled as Michale, Michel, Micheal, and so on. In this case, one token each is generated for Michale, Michel, and Micheal.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of Exact in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, notExactSame.
  • Intended for use with comparator class: StringCharactersComparator.
  • Does not support non-latin character sets.
6
DictionaryStatsPhoneticFuzzyToken There is no need to mention about alphabetical order. The class generates a metaphone token (only if fuzzy operand) and a token based on characters frequency. It is able to catch misspellings and neighbouring duplicated characters.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, notExactSame.
  • Intended for use with comparator class: MetaphoneComparator. However in general the values with the same frequency-based tokens but different metaphone tokens will not result in a match. If it is required to compare two values according to letter statistics, then it is better to use the DistinctWordsComparator class with the thresholdChars parameter.
  • Does not support non-latin character sets.
2
ComplexPhoneticNameToken The ComplexPhoneticNameToken class works in the same way as the DictionaryStatsPhoneticFuzzyToken class but selects only letters (DictionaryStatsPhoneticFuzzyToken class selects letters and digits). The class is applicable to any values not only names.

Examples

Characters ordering: john, jhon Tokens: [JN, A+IAKK8-4fth-48], [JHN, A+IAKK8-4fth-48]. The two values have common statistical token.

Double characters: john, johhn Tokens: [JN, A+IAKK8-4fth-48], [JN, A+IAKK8-4fth-48]. The two values have common statistical and metaphone tokens.

  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of the Exact part in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, notExactSame.
  • Intended for use with comparator class: MetaphoneComparator. However, in general the values with the same frequency-based tokens but different metaphone tokens will not result in a match. If it is required to compare two values according to letter statistics, then it is better to use the DistinctWordsComparator class with the thresholdChars parameter.
  • Does not support non-latin character sets.
2
AddressLineMatchToken This class generates a token for the AddressLine1 attribute as follows:
  1. Lowercases the attribute value, removes all non-letter, non-digit characters, check if the attribute value starts with po-box. If yes then the result token is a concatenation of the input words (PO BOX 100 Marine Pkwy #275->po-box-100-marine-pkwy-275).
  2. Splits the attribute value into separate words.
  3. Removes all numbers. If the address consists only of numbers, then ZERO-ADDRESS is returned.
  4. Removes all garbage words (such as ste, blvd, ave). If the address consists only of garbage words, then all garbage words are preserved. Single character words are ignored. If there are no non-garbage words then the words are used as is (after step 3).
  5. Sorts remaining words alphabetically.
For example:
  • 123 John Kennedy street Ste 12 results in JN-KNT for fuzzy matching and to john-kennedy for exact matching.
  • 123 street prkwy 12 results in PRK-STRT for fuzzy matching and to prkwy-street for exact matching.
  • 123 12 results in ZERO-ADDRESS for both fuzzy matching and exact matching.
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of "Exact" in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, and notExactSame.
  • Intended for use with comparator class: MetaphoneComparator. However, in general the values with the same frequency-based tokens but different metaphone tokens will not result in a match. If it is required to compare two values according to letter statistics, then it is better to use the DistinctWordsComparator class with the thresholdChars parameter.
  • Supports non-latin character sets.
1
OrganizationNameMatchToken Generates tokens by using the following steps:
  • Splits the organization name into multiple words.
  • Removes words such as inc, limited, corp, and so on.
  • Generates One main token consisting of all words (sorted, stemmized and concatenated with -) including noise words
  • Generates Tokens for each word pair (words in pair are sorted and stemmized, concatenated with -).
  • If there were no noise words in the value then tokens are generated for each word (stemmized).
Note: If the attribute is used by the Fuzzy comparator operator, then this token class will generate metaphone codes instead of word values.

Example 1 - Anheuser-Busch InBev. No noise words: anheus-busch-inbev, inbev, anheus-busch, busch-inbev, anheus-inbev, anheus, busch.

Example 2 - International Business Machines. All words are in noise dictionary: busi-intern-machin, busi-intern, busi-machin, intern-machin.

Example 3 - Reltio Connected Customer 360. Connected and Customer are in noise dictionary (not explicitly but after stemmer). 360 - digits connect-custom-reltio, reltio.

For International Business Machines, tokens are as mentioned in example 2 above.

Reltio recommends to use this token class along with the OrganizationNamesComparator comparator class.

  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of "Exact" in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, and notExactSame.
  • Intended for use with comparator class: MetaphoneComparator. However, in general the values with the same frequency-based tokens but different metaphone tokens will not result in a match. If it is required to compare two values according to letter statistics, then it is better to use the DistinctWordsComparator class with the thresholdChars parameter.
  • Supports non-latin character sets.
N-number of words excluding noise
RangeNumericMatchToken Generates appropriate tokens for a range numeric values. If the attribute value is identified for an Exact match, this class generates a token formatted to have four digits after the decimal. If the attribute value is marked for a fuzzy match, this class generates a token based on the value of the threshold parameter. The value of this parameter can be a number of type (Int, Integer, Number, Double) or a percentage value, for example, 0.25 or 10%. This value indicates the maximum acceptable difference for the comparison to evaluate to True.

Example 1 - RangeNumericMatchToken with fuzzy operand.

Two values are 12.55, 12.9. They should result in same tokens (at least one token should be common). If we set the threshold=5%, then the tokens are:
  • 12.0000, 13.0000 for the first value
  • 12.0000, 13.0000 for the second value
So, the two values match. If the two values are 13.55, 12.9, then the tokens are:
  • 13.0000, 14.0000 for the first value
  • 12.0000, 13.0000 for the second value
So, the two values match. If the two values are 14.55, 12.9, then the tokens are:
  • 14.0000, 15.0000 for the first value
  • 12.0000, 13.0000 for the second value
These two values do not match.
  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame).
  • If chosen, implements the behavior of Exact in each of the following Comparison Operators: Exact, ExactOrNull, ExactOrAllNull, and notExactSame.
  • Intended for use with comparator class: RangeNumericComparator. Note that when the RangeNumericComparator class is used, all parameters passed to the comparator class are used for RangeNumericComparator class as well.
  • Does not support non-latin character sets.
1 (in case of exact) or 2 (in case of fuzzy)
BasicTokenizedOrganizationNameMatchToken Same as for OrganizationNameMatchToken class except that the BasicTokenizedOrganizationNameMatchToken class does not use stemmer, has a small set of delimiters, and has a short noise dictionary.

Reltio recommends to use this token class along with the BasicTokenizedOrganization NameComparator comparator class.

CustomMatchToken This match token class is like a container for unlimited groups of other match token classes. Each group is configured using match token parameters that are also grouped. Each parameter group contains a list of parameters and these parameter groups must be configured as a list of group elements. The following parameters are supported:
  • "className": match token class name which is used to generate tokens for group. Default value is ExactMatchToken.
  • "classParams": parameters (if any) for the match token class.
  • pattern-regular expression pattern used to match and extract group specific values from the original attribute values. If this parameter is not specified, the original attribute value is used as group specific. Default value is"\\S+".
  • "noiseDictionary": name of a predefined noise words dictionary or the URL of a custom file with noise dictionary words. A noise dictionary contains words that are excluded from the attribute values before generating the tokens. Possible predefined values:
    • addressLine
    • organizationName
Example of a URL to the custom dictionary file: https://s3.amazonaws.com/test.api.tmp.data/noiseWords.txt
Note: To update an already loaded dictionary, you must restart the API Server. Therefore, Reltio recommends to use a different file name to ensure that the URL points to the updated file. Note that the Custom dictionary file size has a limitation of 10M characters. The file size correlates with the 10M chars limitation but is not strongly coupled. For example, spaces are trimmed and not counted in the limitation, the characters often take only one byte. The address-line-garbage.txt has 9937 chars, 1734 line breaks, and '11,671 bytes' on disk. If the size exceeds this limit, the remaining portion of the dictionary is skipped during load.
  • useNoiseIfEmpty: If enabled and the value contains noise words only, then no noise words are removed from the value and tokens are generated for the noise words. Default is true.
  • "useStemmer": If enabled, the words are stemmed to their base form. Default is "false".
  • useSoundex: If enabled, the words are replaced by their soundex codes. Default is false.
  • wordDelimeter: The symbol that joins words inside a group before the value is passed to the specified match token. The default value is whitespace.
  • sortWords: If enabled, words inside a group are sorted alphabetically before the value is passed to the specified match token. Otherwise, the original order of words is retained. Default is true.

The algorithm parses through all configured groups, and for each group it performs the following:

  1. Splits the value into words according to the specified regexp pattern.
  2. Constructs a list of lowercase words.
  3. Stemmizes words (if enabled).
  4. Removes noise words (if noise dictionary is specified).
  5. Replaces words with soundex codes (if enabled).
  6. Sort words (if enabled) and join thenm into one value by using a word delimiter.

The final token list is constructed by adding generated tokens from all the groups. In other words, the final token is a join of single tokens from each of the groups separated by the colon (':') character.

DistinctWordsMatchToken This match token class generates tokens by distinct words based on the parameters specified. The following parameters are supported:
  • "pattern": regular expression pattern used to match and extract distinct words from the attribute value. Default value is "\\W+". This pattern is to split the value to words by non-word characters.
  • "threshold": the number of words for which to generate a token. The threshold value can be an absolute value or a percentage of words from an actual value. Default value is "50%". For example, the following tokens are generated for the value aa bb cc:
    • if threshold=2, then the tokens generated are for: aa-bb, aa-cc, bb-cc
    • if threshold=50%, then the tokens generated are for:aa-bb, aa-cc, bb-cc, aa-bb-cc
  • "noiseDictionary": name of a predefined noise words dictionary or the URL of a custom file with noise dictionary words. A noise dictionary contains words that are excluded from the attribute values before generating the tokens. Possible predefined values:
    • addressLine
    • organizationName
    • internationalOrganizationName
Note: To update an already loaded dictionary, you must restart the API Server. Therefore, Reltio recommends to use a different file name to ensure that the URL points to the updated file. Note that the Custom dictionary file size is limited to 20MB. If the size exceeds this limit, the remaining portion of the dictionary is skipped during load.
There is a limitation on the total amount of generated tokens per value of 1000. If the amount of tokens for a subset words is greater than the remaining space, the subset is ignored. For example, if we have the threshold=50% and there are 55 words in the value, then the tokens for subsets having 54 and 55 words are generated and subsets having 53 and less words are ignored.
  • useNoiseIfEmpty: If enabled and the value contains noise words only, then no noise words are removed from the value and tokens are generated for the noise words. Default is true.
  • "useStemmer": If enabled, the words are stemmed to their base form. Default is "false".
  • useSoundex: If enabled, the words are replaced by their soundex codes. Default is false.
  • sortWords: If enabled, words inside a group are sorted alphabetically before the value is passed to the specified match token. Otherwise, the original order of words is retained. Default is true.

The algorithm does the following:

  1. Splits the value into distinct words according to the specified regexp pattern.
  2. Constructs a list of lowercase words.
  3. Stemmizes words (if enabled).
  4. Removes noise words (if noise dictionary specified).
  5. Replaces words with soundex codes (if enabled).
  6. Generate tokens for the subsets of words with the size equal to threshold. If the threshold value is a percent value, additional tokens for subsets with size greater than threshold are generated.