Compare Similarity Between Two Lists in Python

To compare similarity between two lists in Python we can calculate:

  • set intersection
  • cosine similarity
  • etc

Similarity would depend also on the data types of the items. For example:

  • integer
  • float
    • 5.04 vs 5.03
  • string
    • grapefruit vs grape

Let's cover several cases on how to compute similarity between two Python lists or arrays.

We will answer on next questions:

  • python get list difference
  • similarity of two lists in python
  • python - find similarity between two lists
  • find similarity between two words in python

1. Calculate similarity with difflib

Quick and efficient way to compute similarity of two numeric or string lists is done by: difflib.SequenceMatcher(None,s1,s2):

Similarity of numeric lists

import difflib

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]

sm = difflib.SequenceMatcher(None,l1,l2)
sm.ratio()

which give use:

0.18181818181818182

How does it work?

While testing with:

l3 = [1, 3, 4, 5]
l4 = [2, 3, 4, 5]

sm = difflib.SequenceMatcher(None,l3,l4)
sm.ratio()

result:

0.75

So it seems that it compares element by element and checks if items match.

More information is available in the docs - link in the resources sections:

Return a measure of the sequences’ similarity as a float in the range [0, 1].

Similarity two string lists

To find similarity of lists of words we can use the same method:

l5 = list("abcded")
l6 = list("acdefd")

sm = difflib.SequenceMatcher(None,l5,l6)
sm.ratio()

So for lists:

  • ['a', 'b', 'c', 'd', 'e', 'd']
  • ['a', 'c', 'd', 'e', 'f', 'd']

we get:

0.8333333333333334

2. Set Intersection

We can find the similarity of elements of lists by checking the intersection of their elements. This is helpful when the location of the element doesn't matter:

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]

intersection = set(l1) & set(l1)

similarity = len(intersection) / (len(l1) + len(l2) - len(intersection))

print(similarity)
  • first we find the intersection
    • {1, 2, 3, 4, 5}
  • then we calculate ratio intersection length and sum of list length
    • 0.8333333333333334
  • we can check also elements not present in the lists:
    • set(l1) - intersection - elements of l1 not present in l2
      • set()
    • set(l2) - intersection -elements of l2 not present in l1
      • {6, 8}

3. Jaccard index in Python

The Jaccard index(Jaccard similarity coefficient), is a statistical method used for finding the similarity and diversity of sample sets.

We can use Jaccard index to calculate similarity of two lists in Python:

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]

intersection = len(set(l1) & set(l2))
union = len(set(l1) | set(l2))

similarity = intersection / union

print(similarity)

result:

0.5714285714285714

5. Cosine Similarity in Python

In data science, cosine similarity helps to measure similarity between two non-zero arrays/vectors.

To calculate cosine similarity in Python we can:

from math import sqrt

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]

dot_product = sum(i * j for i, j in zip(l1, l2))
magnitude1 = sqrt(sum(i ** 2 for i in l1))
magnitude2 = sqrt(sum(j ** 2 for j in l2))

similarity = dot_product / (magnitude1 * magnitude2)

print(similarity)

result:

0.9820303262342949

6. Euclidean Distance - compare lists in Python

In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.

We can implement euclidean distance in Python to compare two lists:

import math

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]

squared_distance = sum([(i - j) ** 2 for i, j in zip(l1, l2)])
distance = math.sqrt(squared_distance)

similarity = 1 / (1 + distance)

print(similarity)

result:

0.16952084719853724

7. Hamming distance in Python

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.

Examples:

  • "karolin" and "kathrin" - 3
  • "karolin" and "kerstin" - 3
  • "kathrin" and "kerstin" - 4
  • 0000 and 1111 - 4
  • 2173896 and 2233796 - 3

To compare two lists of strings or numbers and find similarity we can do:

list1 = [1, 2, 3, 4, 5]
list2 = [2, 4, 6, 8]

distance = sum(i != j for i, j in zip(list1, list2))

similarity = 1 - (distance / len(list1))

print(similarity)

Similarity of the both lists by comparing with Hamming distance is:

0.0

we can measure hamming distance also by using module - scipy - but arrays should have equal size:

from scipy.spatial import distance

l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5]

d = round(distance.hamming(l1, l2) * len(l1))
print(d)

result:

5

Summary

In this post, we saw how to compare two lists in Python and calculate the similarity between them.

We saw detailed examples for 7 different techniques to compute similarity. The examples show the basics for comparison of sets, arrays or lists. To learn more you can read the materials in Resources.

Resources