Text-based similarity is not a magic bullet for data matching


When it comes to data matching without unique identifier, text-based similarity is widely spread. Comparing texts and finding out which are rather similar is used as a guideline to make matching decisions.

Text-based similarity can be defined in various ways, e.g. counting the number of common letters, counting how many changes are required to transform one term into the other, etc.. And all of these approaches have their strengths and weaknesses depending on the type of text to compare (e.g. single words, whole sentences, technical names, etc.). To name a few of these (distance) measures: Most of them measure how many changes (operations) are required to transform one text into the other. The decision in favor or against a matching is then based on a threshold value. This is a fair approach but finding the right threshold is not that easy. Additionally, this method lacks any human intelligence which might work even better, depending on the data.

Text-based similarity vs context-based similarity

To illustrate when text-based similarity gets beaten by context-driven similarity a few examples will be shown. It is about John Adams and his son John Quincy Adams, 2nd and 6th presidents of the United States of America.

The point here is not to reproach text-based similarity. The purpose is to show that human brain power has its place in the data matching landscape as long as AIs are not smart enough for such general tasks like context-based similarity.


For the following overview the Levenshtein Distance is used. It describes the number of changes (i.e. delete, insert, replace) required to transform one string into the other. The Levenshtein Similarity is computed by 1 minus the quotient of the Levenshtein Distance by the length of the longer term. Accordingly the maximum Levenshtein Similarity of 1 describes equality and the minimum Levenshtein Similarity of 0 describes total difference.
Data matching comparison using Levenshtein Similarity and a matching threshold of 0.7
Term 1 Term 2 Distance Similarity Decision Human check
1
John Adams
Adams, John
11 0.00 No Match Wrong, the same person!
2
John Adams
J. Adams
3 0.70 Match Maybe the same person!
3
John Adams
John Q. Adams
3 0.77 Match Wrong, it's father and son!
4
John Adams
POTUS No. 2
10 0.09 No Match Wrong, the same person!
* POTUS = President of the United States

Matchmerize is a tool for humans to create context-driven data matchings more efficiently

Not all data matching tasks benefit from context-driven matchings, like the example above. But there are cases where no AI (not yet) and no text-distance driven algorithm exceeds what a human brain can achieve.

To make the best out of the human brain power a powerful tool is required. Handling matchings can get confusing quickly as there is an inherent threat to mess up the data. This is typically the issue when people want or need to match data themselves and then start to struggle on how to do it efficiently. One thing is clear: A spreadsheet software is not sufficient to do data matching - neither for text-based nor context-based approaches.
Learn why data matching gets confusing quickly and what to do about it