--

Introduction

In the field of Natural Language Processing, people often come across situations where they need to compare strings, which can be words, sentences, paragraphs or even documents. One approach is to create a complex model where the input is a string, and the output is a representation of the word or sentence (called embedding). This solution is excellent, especially when we want to consider both the syntax and the meaning. However, building such a model can be quite challenging. Many times, we are short on time for these tasks, and we need a quick way to determine if two words or sentences are similar or have good or poor similarity. This is similar to when we use our phones and type a wrong word, but the phone suggests the right words to fix it. So, what are the most effective methods to meet these needs, and how do they work? This article will explore these questions.

Definition: String similarity

When we have two numbers, we can compare them easily by subtracting one from the other and looking at the sign and how big the result is. This way of comparing can also work with vectors, and there are many ways to do this. For example, we can calculate the cosine distance, the Euclidean distance, the Manhattan distance, or even use a p-distance with the Minkowski formula:

These comparisons made scientists start thinking about how to compare words, sentences, or strings in general. One simple way is to compare the common letters between strings or words. However, there are many methods and algorithms to solve this problem, and each of them is good for specific needs and situations.

In general, there are three main types of algorithms to measure string similarity, which we will discuss in the next section:

• Edit-based algorithms
• Token-based algorithms
• Sequence-based algorithms

Edit based algorithms

Edit-based algorithms, also known as distance-based algorithms, measure the minimum number of single-character operations (insertions, deletions, or substitutions) required to transform one string into another. The more operations we’ll have the less (greater) the similarity (distance) will be. This metric serves as a foundation for many other string similarity techniques and is widely used in spell checking, autocorrection, and DNA sequence analysis.

Note: every character and every operation here has the same importance

The following are some commonly used algorithms for this section

Hamming distance

This metric is a distance to measure the dissimilarity between two strings of the same length by overlaying one string over another and count how many positions have different characters. Some libraries can ignore the length condition but this algorithm is not meant to handle these cases. Bellow some examples:

`>> import textdistance as td>> td.hamming('book', 'look')1>> td.hamming.normalized_similarity('book', 'look')0.75>> td.hamming('bellow', 'below')3>> td.hamming.normalized_similarity('Below', 'Bellow')0.5`

In the first example we have one different character. This makes the distance equals to 1 and the normalized similarity equals to (4–1)/4 = 75%. The second example we compare “bellow” and “below” the first 3 letters are the same but the next 3 letters aren’t. Therefore the distance is 3 and the normalized similarity is (6–3)/6 = 50%

Levenshtein distance

Levenshtein distance, the most used edit based algorithm, is a string similarity metric used to measure the minimum number of single-character allowed operations (insertions, deletions, or substitutions) required to transform one string into another. It provides a quantifiable measure of how different two strings are. It hasn’t the sequence length condition like the hamming distance. Let’s apply it to the examples above.

`>> td.levenshtein('book', 'look')1>> td.levenshtein.normalized_similarity('book', 'look')0.75>> td.levenshtein('bellow', 'below')1>> td.levenshtein.normalized_similarity('Below', 'Bellow')0.84`

In the first example we can replace one letter by another to get to the other word, then the normalized similarity is (4–1)/4 = 75%. In the second example we have one insertion operation, so the distance is 1 and the normalized similarity is (6–1)/6 = 84%.

Damerau-Levenshtein distance

Damerau-Levenshtein distance is a variation of the Levenshtein distance that also includes the transposition operation. It measures the number of the four operations required to transform one string into another. The transposition involves swapping two adjacent characters.

Note: You cannot transpose two symbols that are not successive and hence, the distance between “stop” and “spot” is not 1 but two (two replacements)

`>> td.levenshtein('act', 'cat')2>> td.levenshtein.normalized_similarity('act', 'cat')0.34>> td.damerau_levenshtein('act', 'cat')1>> td.damerau_levenshtein.normalized_similarity('act', 'cat')0.67`

The levenshtein distance have two replacements of both ‘a’ and ‘c’, but since both letters are adjacent we can transpose them with one operation with the damerau-levenshtein algorithm and it gives twice the similarity results. This algorithm is widely used in the NLP field such as spell checking and sequence analysis.

Jaro similarity

First thing to know is that this algorithm is not a distance but only a similarity score between 0 and 1.

Jaro is an algorithm based on the number of matching characters and on transpositions like the Damerau-Levenstein do except we don’t have the adjacency constraint. The method uses an intuitive formula:

Note: Two characters from s1 and s2 respectively, are considered matching only if they are the same and not farther than max(|s1|, |s2|)/2 - 1 characters apart.

So if no matching characters are found, then the two strings are not similar and the Jaro similarity is 0. But if matching characters are found, then we count the number of transpositions. Let’s see three examples.

`>> td.jaro('bellow', 'below')0.94>> td.jaro('simple', 'plesim')0>> td.jaro('jaro', 'ajro')0.92`

In the first example we have 5 matching characters and one insertion (this is not a transposition operation) so the Jaro similarity is 1/3*(5/6+5/5+6/6). The second example we have 0 matching characters because the common characters are not in the range of max(|s1|, |s2|)/2–1. This is why the jaro similarity is 0. The final example we have 4 matching characters and 1 transposition operation between the first and second letter so the Jaro similarity is 1/3 * (4/4+4/4+3/4) = 0.91

Jaro-Winkler similarity

The Jaro winkler is a modification of Jaro similarity. It is designed to give more weight to the common prefix of the strings. This is going to give greater score to strings that have the first l characters in common. Its formula is:

`>> td.jaro("simple", "since")0.7>> t.jaro_winkler("simple", "since")0.76`

As we can see in the example above since the two strings have a prefix of two common letters. The Jaro-Winkler similarity is greater than the Jaro similarity. We have: 0.7 + 0.1*2*(1–0.7) = 0.7 + 0.06 = 0.76

Smith–Waterman similarity

Smith-Waterman is a dynamic programming algorithm used to find the optimal local alignment between two sequences. Unlike the Needleman-Wunsch algorithm, which finds the optimal global alignment, Smith-Waterman algorithm identifies the best matching subsequence within the sequences which makes it more relevant than the Needleman-Wunsch algorithm.

The algorithm assigns a score to each possible local alignment based on a scoring matrix that defines the similarity or dissimilarity between pairs of characters. It looks for regions in the sequences where the score is maximized, indicating the best local match.

The Smith-Waterman algorithm is particularly useful in bioinformatics when identifying similar regions or motifs within biological sequences, such as DNA or protein sequences, rather than looking for similarities throughout the entire sequences. It allows for the detection of local similarities that may be biologically significant.

`>> td.smith_waterman("GATTACA", "GCATGCU")3>> td.smith_waterman("GATTACA", "GCATGCU")0.43`

Token based algorithms

Token-based algorithms focus on comparing strings based on their constituent tokens or words, rather than individual characters (but sometimes tokens are only characters).

Jaccard similarity

Jaccard distance is a measure of similarity between two sets. It quantifies how similar the sets are by comparing their shared elements with their total combined elements. To calculate it, you need to find the size of the intersection (shared elements) divided by the size of the union (all unique elements) of the sets.

Here’s a simple explanation of the Jaccard distance with an example:

`>> td.jaccard('jaccard similarity'.split(), "similarity jaccard".split())1>> td.jaccard('jaccard similarity'.split(), "similarity jaccard jaccard".split())0.66`

We can see that Jaccard similarity do not take into consideration the order the string words.

Note: The implementation in textdistance package takes into consideration repeated elements, but the classical algorithm with the formula above would give 1 in the second example.

Note 2: If you don’t split your sequence to words, the algorithm will split it to characters and apply the Jaccard formula on them.

Sørensen-Dice similarity

Sørensen-Dice similarity or coefficient, is a metric to measure similarity between two sets like the Jaccard similarity do. It is commonly used in various fields, including data analysis, text mining, and image processing.

It is calculated by finding the ratio of twice the number of shared elements (intersection) between the sets to the sum of the sizes of the sets. The formula for Sørensen-Dice similarity is as follows:

`>> td.sorencen('jaccard similarity'.split(), "similarity jaccard".split())1>> td.sorencen('jaccard similarity'.split(), "similarity jaccard jaccard".split())0.8`

Sørensen-Dice similarity has the same remarks and notes with the Jaccard similarity, the only difference between this algorithm and the Jaccard algorithm that the former emphasizes more the shared elements.

Sørensen-Dice similarity is particularly useful when dealing with binary data or situations where presence or absence of elements is important. It provides a measure of similarity that takes into account the size of the sets and emphasizes the shared elements.

The relation ship between Jaccard and Sorensen-Dice is:

Tversky similarity

The Tversky index is a similarity algorithm measure that quantifies the degree of overlap between two sets, taking into account both false positives and false negatives. It is particularly useful when dealing with imbalanced data or situations where the presence or absence of elements in the sets has varying importance. So here we have the choice to emphasize the common words (characters) or the non-shared one.

It’s defined by the following formula:

As we have detailed above, we can say that the Tversky index can be seen as a generalization of the Sorensen-Dice and the Jaccard algorithm.

`>> td.sorencen('tversky similarity'.split(), "similarity tversky tversky".split())0.8>> tversky = td.Tversky(ks=(0.5, 0.5))>> tversky('tversky similarity'.split(), "similarity tversky tversky".split())0.8>> td.jaccard('tversky similarity'.split(), "similarity tversky tversky".split())0.67>> tversky = td.Tversky(ks=(1, 1))>> tversky('tversky similarity'.split(), "similarity tversky tversky".split())0.67>> tversky = td.Tversky(ks=(0.2, 0.8))>> tversky('tversky similarity'.split(), "similarity tversky tversky".split())0.74`

As we can see the Tversky index allows you to balance the importance of different types of elements in sets, making it suitable for various applications such as information retrieval, data mining, and medical diagnosis.

Overlap similarity

The overlap coefficient, similarity, or Szymkiewicz–Simpson coefficient, is a simple similarity between two sets. It calculates the ratio of the size of the intersection of the sets to the size of the smaller set. The Overlap Coefficient is particularly useful when dealing with binary data or situations where the presence or absence of elements is important. Its following formula is:

This simple formula means also that if a set is a subset of the other set.

`>> td.overlap('overlap similarity'.split(), "similarity overlap overlap".split())1.0`

Cosine similarity

Cosine similarity is a widely used similarity measure that quantifies the cosine of the angle between two non-zero vectors in a multi-dimensional space. It’s often used to compare the similarity between documents, texts, or other high-dimensional data points. Cosine similarity captures the orientation or direction of the vectors rather than their magnitude.

The formula for cosine similarity between two vectors A and B is as follows:

Where A.B is the dot product between the vector A and B, and ||A|| represents the Euclidean norm of the vector A.

The resulting measure of similarity spans from -1, signifying complete opposition, to 1, indicating absolute identity. A value of 0 signifies orthogonality or decorrelation, while values in between denote varying degrees of similarity or dissimilarity.

In text matching, the attribute vectors A and B usually represent the term frequency vectors of the documents. Cosine similarity serves as a mechanism to standardize document length during comparison. In the context of information retrieval, the cosine similarity between two documents falls within the range of 0 to 1, as term frequencies cannot be negative. This holds true even when utilizing TF-IDF weights. The angle between two term frequency vectors cannot exceed 90°.

`>> td.cosine('cosine'.split(), "similarity".split())0>> td.cosine('cosine sim'.split(), "cosine sim sim".split())0.81`

Note: textdistance uses different computation for cosine similarity which is different than the standard similarity. We recommend using scikit learn cosine similarity which gives 0.94 for the second example:

A =[1, 1] ; B = [1, 2] so A.B = 3 so cos(theta) = 3/sqrt(2*5) = 0.94

N-gram similarity

The n-gram comparison algorithm is a method used to measure the similarity between two strings by analyzing their subsequence of n consecutive characters, called n-grams. N-grams are essentially substrings of length n extracted from a given string. This algorithm is commonly used in text analysis, natural language processing, and similarity comparison tasks. The process of the n-gram comparison algorithm involves the following steps:

N-gram Extraction: Divide each input string into overlapping sequences of n characters.
Counting N-grams: Count the occurrences of each unique n-gram in both strings.
Calculating Similarity: Compare the n-gram counts between the two strings and compute a similarity score. The similarity score can be calculated using various metrics, with Jaccard similarity and cosine similarity being commonly used options.

Note: We will use Jaccard similarity and trigram algorithm for our examples

Let’s take an example comparing between two words with a typo: (trigrasm and trigrams)

`>> trigrams = {"tri", "rig", "igr", "gra", "ram", "ams"}>> trigrasm = {"tri", "rig", "igr", "gra", "ras", "asm"}# there are 8 unique grams in our example so# there are 4 shared words and 4 different ones>> result = 4/80.5`

N-gram comparison is useful for various text-related tasks, such as document similarity analysis, plagiarism detection, and machine translation. It captures local patterns within text and can provide insights into the structural resemblance of strings. The choice of n (the length of n-grams) can impact the level of granularity and sensitivity of the comparison. But this algorithm fails when it comes to short strings and its disadvantage is the fact that this algorithm give more emphasis on different grams than the shared ones.

Sequence based algorithm

Sequence-based algorithms are algorithms that focus more on analyzing and comparing the entire sequence as opposed to token based algorithms where we compare tokens in the sequence. These algorithms prove especially valuable when handling DNA sequences, protein sequences, or natural language sentences. Here are a few illustrative examples:

Ratcliff-Obershelp similarity

The Ratcliff-Obershelp similarity or Gestalt pattern matching is a string similarity measure that focuses on finding the similarity between two strings based on their common substrings. It is particularly useful for comparing strings that have similar structures but may differ in terms of minor modifications, deletions, or insertions. The algorithm assigns a similarity score based on the length of the longest common substring between the two strings.

The Ratcliff-Obershelp algorithm works as follows:

Find Longest Common Substring (LCS): Identify the longest common substring between the two input strings. This is done by finding the common subsequence of characters that appears in the same order in both strings.
Apply the algorithm: We start by removing the LCS part and split them at the same spot. This splits the strings into two sections: one on the left and another on the right of the common part. Then, take the left portions of both strings and apply the process again to find their LCS. Repeat this for the right portions too. Keep doing this recursively until any split part becomes smaller than a certain set value.
Compute Similarity score: Finally, we compute the similarity score by the Sorensen-Dice formula mentioned earlier. This score is twice the number of characters shared, divided by the total number of characters in both strings.

Let’s give a good example to clear the fog about this algorithm comparing between “RO pattern matching” and “RO practice”

The common substrings are “RO p” (4 char), ‘a’ (1 char), ‘t’ (1 char), ‘e’ (1 char). This means the similarity is 2*(4+1+1+1)/ (19 + 11)=0.467

Note: There are two disadvantages of this algorithm, first one it’s its complexity which is O(n³) and the other one is that the algorithm is not commutative which means D(X, Y) != D(Y, X)

We can see the second property in the example above. The first illustration we have D(“RO pattern matching”, “RO practice”). Let’s compute the other way:

The common substrings are “RO p” (4 char), ‘r’ (1 char), ‘a’ (1 char), ‘c’ (1 char), ‘i’ (1 char). This means the similarity is 2*(4+1+1+1+1)/ (19 + 11)=0.53

`>> s1, s2 = "RO PATTERN MATCHING", "RO PRACTICE">> td.ratcliff_obershelp(s1, s2), td.ratcliff_obershelp(s2, s1), len(s1), len(s2)(0.46, 0.53, 19, 11)`

Longest common substring/subsequence similarity

The longest common substring algorithm, is a string similarity algorithm that focuses on finding the longest common substring between two strings. It measures the similarity between the strings by identifying the longest sequence of consecutive characters that they share. The similarity score can be computed by dividing the length of the longest common substring by the length of the longest string.

Unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. So the longest common subsequence will always be greater than the longest common substring. The similarity score is computed like the later.

To understand the concept behind we can see the following examples:

`>> s1, s2 = "RO PATTERN MATCHING", "RO PRACTICE">> td.lcsstr(s1, s2), td.lcsseq(s2, s1), td.lcsseq(s2, s1)('RO P', 'RO PRATC', 'RO PRACI')>> td.lcsstr.normalized_similarity(s1, s2), td.lcsseq.normalized_similarity(s1, s2)(0.21, 0.42)`

In the above example we can see that the longest common substring is always a substring that has consecutive characters, but we don’t have this constraint in the longest common subsequence.

Note: The longest common subsequence of (A, B) is different than (B, A). In fact, we can see in the example above the different results (RO PRATC) and (RO PRACI)

Conclusion

In conclusion, the world of string similarity algorithms opens a fascinating realm of possibilities for various applications in natural language processing, data analysis, and beyond. Throughout this article, we delved into a comprehensive exploration of three fundamental categories of algorithms: edit-based, token-based, and sequence-based. The important points to retain are:

Edit-Based Algorithms:

• Damerau-Levenshtein and Jaro winkler distances quantify character-level similarity through edit operations.
• Ideal for applications emphasizing single-character changes and corrections.
• Valuable for spell checkers, OCR, character accuracy and text auto-correction.

Token-Based Algorithms:

• Jaccard and Sørensen-Dice similarities focus on token sets, disregarding sequence.
• Effective for document clustering, plagiarism detection, and recommendation systems.
• Emphasize presence rather than order, aiding in categorization.
• Can be used for content categorization and document comparison.

Sequence-Based Algorithms:

• Longest Common Subsequence (LCS) and Ratcliff/Obershelp methods preserve sequential order.
• Crucial for DNA sequence alignment, text near-duplicate detection, and version control.
• Offer insights into structural relationships within strings.

As we navigated the intricacies of these algorithmic approaches, it became evident that no single method fits all scenarios. Each category carries its strengths, catering to specific needs and contexts. A good data scientist who is armed with the knowledge of these diverse techniques can harness their combined power to tackle a wide array of challenges in string similarity analysis.