GO Online Toolset
home
search
feedback

Text similarity
955  |   |   |  6

Algorithm Introduction

Introduction to Jaccard Algorithm (for long text)

1.Jaccard similarity, also known as Jaccard index or Jaccard coefficient, is a statistic used to measure the similarity of two sets (** can be divided by ngram **). It is defined as the ratio between the intersection and the union of two sets. It is mainly used to calculate the similarity between sets, and is widely used in information retrieval/text analysis/image processing and other fields.
2. In particular, given two sets A and 𝐵, Jaccard similarity 𝐽 (𝐴, 𝐵) is defined as:

J(A,B)=AB/A u B
// ∣A∩B∣ denote the number of intersection elements of sets A and B.
// ∣AuB∣ denotes the number of elements of the union of sets A and B.
  1. The Jaccard similarity ranges from 0 to 1:
// when 𝐽(𝐴,𝐵) = 1, the two sets are exactly the same
// when 𝐽(𝐴,𝐵) = 0, the two sets are completely different

Introduction to Levenshtein algorithm (suitable for short text)

1.Levenshtein distance, also known as edit distance, is a metric used to measure the similarity between two strings. It is defined as the minimum number of editing operations required to convert one string to another. The allowed editing operations include inserting a character/deleting a character/replacing a character.
2. Treatment is as follows:

// Suppose there are two strings:
s1 = "kitten"
s2 = "sitting"
// Calculate their Levenshtein distance:

// replace "k" with "s": kitten ->; sitten (1 operation)
// replace "e" with "i": sitten ->; sittin (1 operation)
// Insert "g" at the end: sittin ->; sitting (1 operation)
// Thus, the Levenshtein distance
Levenshtein("kitten","sitting")=3

// similarity (maxlen-Levenshtein)/maxLen
maxLen=max(s1.length, s2.length)
res=(maxLen-3)/maxLen