Compare Similarity Between Two Lists in Python
To compare similarity between two lists in Python we can calculate:
- set intersection
- cosine similarity
- etc
Similarity would depend also on the data types of the items. For example:
- integer
- float
- string
Let's cover several cases on how to compute similarity between two Python lists or arrays.
We will answer on next questions:
- python get list difference
- similarity of two lists in python
- python - find similarity between two lists
- find similarity between two words in python
1. Calculate similarity with difflib
Quick and efficient way to compute similarity of two numeric or string lists is done by: difflib.SequenceMatcher(None,s1,s2)
Similarity of numeric lists
import difflib
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]
sm = difflib.SequenceMatcher(None,l1,l2)
which give use:
How does it work?
While testing with:
l3 = [1, 3, 4, 5]
l4 = [2, 3, 4, 5]
sm = difflib.SequenceMatcher(None,l3,l4)
So it seems that it compares element by element and checks if items match.
More information is available in the docs - link in the resources sections:
Return a measure of the sequences’ similarity as a float in the range [0, 1].
Similarity two string lists
To find similarity of lists of words we can use the same method:
l5 = list("abcded")
l6 = list("acdefd")
sm = difflib.SequenceMatcher(None,l5,l6)
So for lists:
['a', 'b', 'c', 'd', 'e', 'd']
['a', 'c', 'd', 'e', 'f', 'd']
we get:
2. Set Intersection
We can find the similarity of elements of lists by checking the intersection of their elements. This is helpful when the location of the element doesn't matter:
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]
intersection = set(l1) & set(l1)
similarity = len(intersection) / (len(l1) + len(l2) - len(intersection))
- first we find the intersection
{1, 2, 3, 4, 5}
- then we calculate ratio intersection length and sum of list length
- we can check also elements not present in the lists:
set(l1) - intersection
- elements of l1 not present in l2set()
set(l2) - intersection
-elements of l2 not present in l1{6, 8}
3. Jaccard index in Python
The Jaccard index(Jaccard similarity coefficient), is a statistical method used for finding the similarity and diversity of sample sets.
We can use Jaccard index to calculate similarity of two lists in Python:
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]
intersection = len(set(l1) & set(l2))
union = len(set(l1) | set(l2))
similarity = intersection / union
5. Cosine Similarity in Python
In data science, cosine similarity helps to measure similarity between two non-zero arrays/vectors.
To calculate cosine similarity in Python we can:
from math import sqrt
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]
dot_product = sum(i * j for i, j in zip(l1, l2))
magnitude1 = sqrt(sum(i ** 2 for i in l1))
magnitude2 = sqrt(sum(j ** 2 for j in l2))
similarity = dot_product / (magnitude1 * magnitude2)
6. Euclidean Distance - compare lists in Python
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
We can implement euclidean distance in Python to compare two lists:
import math
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5, 1]
squared_distance = sum([(i - j) ** 2 for i, j in zip(l1, l2)])
distance = math.sqrt(squared_distance)
similarity = 1 / (1 + distance)
7. Hamming distance in Python
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
- "karolin" and "kathrin" - 3
- "karolin" and "kerstin" - 3
- "kathrin" and "kerstin" - 4
- 0000 and 1111 - 4
- 2173896 and 2233796 - 3
To compare two lists of strings or numbers and find similarity we can do:
list1 = [1, 2, 3, 4, 5]
list2 = [2, 4, 6, 8]
distance = sum(i != j for i, j in zip(list1, list2))
similarity = 1 - (distance / len(list1))
Similarity of the both lists by comparing with Hamming distance is:
we can measure hamming distance also by using module - scipy
- but arrays should have equal size:
from scipy.spatial import distance
l1 = [1, 3, 4, 5, 2]
l2 = [2, 4, 6, 8, 5]
d = round(distance.hamming(l1, l2) * len(l1))
In this post, we saw how to compare two lists in Python and calculate the similarity between them.
We saw detailed examples for 7 different techniques to compute similarity. The examples show the basics for comparison of sets, arrays or lists. To learn more you can read the materials in Resources.