First step of Customer Delight Journey – Data Deduplication

First step of Customer Delight Journey – Data Deduplication

We see our customers as guests at a party, and we are the hosts. It’s our daily job to make every important aspect of the customer experience a little bit better.
– Jeff Bezos

Duplicate Customer data commonly can be a cause for a host of issues. In certain circumstances, there’s even a chance of duplicated data leading to unsatisfied customers.

Let’s take a call center as an example. An outbound caller may call a customer, offering them a particular product. The customer declines, and the caller logs this in the CRM system. However, the business has two records for the same customer, perhaps due to a misspelled name. Another salesperson phones out to the same customer using a different record, resulting in a bad customer experience by making the same offer twice.

It’s easy to see just how much of a problem this is, especially when it comes to customer data duplication. So, how can companies effectively overcome the issue?

Data Deduplication

Data deduplication refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted or linked, leaving only one copy of the data to be stored. Fuzzy matching is one of the ways to achieve data deduplication.

What is Fuzzy Matching?

Fuzzy matching, also known as probabilistic record linkage. Use fuzzy logic to determine whether there is any similarity between data elements. What makes fuzzy matching different from traditional database searching is that results for a query are returned based on likely relevance. Search words and spellings do not necessarily have to match database records exactly in order to yield results. Data matches may include alternate spellings of a search term. Fuzzy matching calculates the probability that different records and search terms relate to the same thing.

The idea behind data deduplication using Fuzzy Matching is to find approximately the same records rather than an entirely accurate match. For example, a good customer deduplication algorithm should be able to identify that all below customer records belong to the same customer –

While this may sound like a complex effort, it is actually easy, thanks to the Fuzzy matching available in multiple technologies. Having worked on many customer analytics implementations, I know that deciding the right approach for deduplication is key to successful customer strategy. In this series of blogs, I will write about various strategies to achieve customer deduplication.

Fuzzy Lookup/Grouping functionality is built into Microsoft SQL Server Integration Services (SSIS).

Fuzzy Lookup allows SSIS to inspect a set of data and compare one or more fields in the dataset. It matches strings based on their Levenshtein Distance- giving more accurate results while overcoming misspellings, typos, abbreviations, nicknames, etc.

Typically, customer data can be deduplicated based on their First Name, Last Name, Date of birth, Street address, Phone, SSN, and City.

Configuring Fuzzy Lookup is a pretty straightforward process. Select the reference table of the fields you want to compare and set the similarity threshold. This process can involve a little bit of trial and error while you fine-tune the Fuzzy Lookup to identify the records that are potential duplicates without letting through any false positives.

Fuzzy Lookup evaluates the customer records and compares them based on selected fields. Potentially duplicate records are then assigned with similarity scores.

After Fuzzy Lookup, a step to check the similarity scores and group count determines which records are potential dupes. The potential duplicates are exported to an Excel file for review. You can view the results in Excel when the process is complete. Here, you can see the unique record number assigned to the record, the Master record number, the percentage of the match to the potential duplicate record, and the percentage of match for each. The highlighted values in the screenshot show some of the values that were in comparison and demonstrate how Fuzzy Lookup can identify potential duplicates despite misspellings, nicknames, and partial matches.

In my next blog, I will demonstrate the use of Big Data technologies to perform customer deduplication.

“Information is not knowledge,” Albert Einstein once said. Being a genius, he could conclude the ultimate truth. But we can strive to turn information into knowledge by ensuring the quality of data!