zaro

What is the Jaccard Similarity of Two Words?

Published in Text Similarity 4 mins read

The Jaccard similarity of two words measures the overlap between them by treating each word as a set of elements, most commonly characters or character sequences (n-grams). It calculates the ratio of the number of common elements to the total number of unique elements present in both words. This method quantifies how similar two words are based on their shared components.

Understanding Jaccard Similarity

Jaccard similarity, also known as the Jaccard index or Intersection over Union, is a statistic used for gauging the similarity and diversity of sample sets. For two sets, A and B, the Jaccard similarity is defined as the size of the intersection divided by the size of the union of the sets.

Mathematically, the Jaccard similarity ($J$) between two sets A and B is:

$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

Where:

  • $|A \cap B|$ represents the number of elements common to both set A and set B (the intersection).
  • $|A \cup B|$ represents the total number of unique elements in either set A or set B (the union).

The score ranges from 0 to 1, where:

  • 0 indicates no common elements (completely dissimilar).
  • 1 indicates identical sets (completely similar).

While the general definition of Jaccard similarity applies to "sets of words" (e.g., comparing two documents based on their vocabulary), when comparing two individual words, they must first be converted into sets of smaller units for the calculation to be meaningful.

Applying Jaccard Similarity to Individual Words

When the question refers to the Jaccard similarity of two individual words, it typically implies converting these words into sets of their constituent characters or n-grams.

1. Jaccard Similarity of Character Sets

This is the most straightforward way to apply Jaccard similarity to individual words. Each word is treated as a set of its unique characters.

Example: Calculating the Jaccard similarity between "apple" and "apply".

Word Set of Unique Characters
apple $A = {'a', 'p', 'l', 'e'}$
apply $B = {'a', 'p', 'l', 'y'}$
  1. Intersection ($A \cap B$): The common characters are 'a', 'p', 'l'.
    $|A \cap B| = 3$
  2. Union ($A \cup B$): All unique characters from both sets are 'a', 'p', 'l', 'e', 'y'.
    $|A \cup B| = 5$
  3. Jaccard Similarity:
    $J(\text{apple}, \text{apply}) = \frac{|A \cap B|}{|A \cup B|} = \frac{3}{5} = 0.6$

2. Jaccard Similarity of N-gram Sets

For a more nuanced comparison, words can be broken down into n-grams, which are contiguous sequences of n characters. Common choices include bigrams (n=2) or trigrams (n=3). This approach can better capture the order and local structure of words.

Example: Calculating the Jaccard similarity between "fruit" and "flute" using bigrams (n=2).

Word Bigrams Set of Unique Bigrams
fruit fr, ru, ui, it $A = {\text{'fr', 'ru', 'ui', 'it'}}$
flute fl, lu, ut, te $B = {\text{'fl', 'lu', 'ut', 'te'}}$
  1. Intersection ($A \cap B$): There are no common bigrams.
    $|A \cap B| = 0$
  2. Union ($A \cup B$): All unique bigrams from both sets.
    $|A \cup B| = 8$
  3. Jaccard Similarity:
    $J(\text{fruit}, \text{flute}) = \frac{|A \cap B|}{|A \cup B|} = \frac{0}{8} = 0$

This example shows that even for somewhat visually similar words, if their n-gram composition differs entirely, the Jaccard similarity will be zero. This highlights the importance of choosing the right granularity (character vs. n-gram, and the n-gram size) for your comparison.

Practical Applications

Jaccard similarity, whether applied to individual words (via character/n-gram sets) or larger sets of words (documents), has various applications in text analysis and data science:

  • Spelling Correction: Identifying words that are phonetically or structurally similar to a misspelled word.
  • Plagiarism Detection: Comparing documents or code snippets to find identical or highly similar sections.
  • Document Similarity: Measuring how related two documents are based on their shared vocabulary.
  • Bioinformatics: Comparing DNA sequences by treating them as sequences of characters.
  • Recommendation Systems: Finding similar items based on shared attributes.
  • Search Engines: Ranking search results based on the similarity between queries and document content.

For further reading on Jaccard similarity and its applications in text analysis, you can refer to resources like Wikipedia's Jaccard Index page.