The complete guide to string similarity algorithms
Introduction
In the field of Natural Language Processing, people often come across situations where they need to compare strings, which can be words, sentences, paragraphs or even documents. One approach is to create a complex model where the input is a string, and the output is a representation of the word or sentence (called embedding). This solution is excellent, especially when we want to consider both the syntax and the meaning. However, building such a model can be quite challenging. Many times, we are short on time for these tasks, and we need a quick way to determine if two words or sentences are similar or have good or poor similarity. This is similar to when we use our phones and type a wrong word, but the phone suggests the right words to fix it. So, what are the most effective methods to meet these needs, and how do they work? This article will explore these questions.
Definition: String similarity
When we have two numbers, we can compare them easily by subtracting one from the other and looking at the sign and how big the result is. This way of comparing can also work with vectors, and there are many ways to do this. For example, we can calculate the cosine distance, the Euclidean distance, the Manhattan distance, or even use a p-distance with the Minkowski formula: