Remove Noise Words

Noise words are generic words specific to an entity type that must be removed for better matching performance.

Noise Words (Garbage words) are words found so commonly in an attribute that they dilute the effectiveness of the more meaningful values in the match process. Example noise words for Organizations are Corp, LLC, Inc, and so on. For Addresses, example noise words are St, Street, Avenue, and Ave. It is often desirable to ignore these words when generating tokens and doing comparison.

This capability is not available as a standalone cleanser but can only be invoked within the context of a comparator and token class. For your convenience, Reltio provides an out-of-the box noise words removal function and a preset set of noise words for Organizations and for Addresses. It is built into the BasicTokenizedOrganizationNameComparator and AddressLineComparator, and their companion match token classes. Each of these classes utilizes an in-built list of noise words that you can download.

If you wish to develop and leverage your own list of noise words, you can:

  1. Create a text file (for example, myNoiseWords.txt) that looks simply like this:
    • inc
    • co
    • corp
    • corps
    • corporation
    • corporate
    • company
    • service
    • services
    It will be used in a case-insensitive manner but you should only use lower case anyway as a best practice.
  2. Submit the list to Reltio by filing a support ticket at support@reltio.com, with your text file attached and requesting the task Add file for noise words removal. You will receive a full path name to the file.
  3. Create a custom comparator class and specify the full path and filename of the file, in the proper parameter field of the custom class. For more information, see Comparator Classes.