This brief article will show you how to find duplicate lines in files in Linux Mint. There are several different ways to find and count duplicate lines and we will cover the most popular ones.
Suppose we have a text or CSV file with next content - test.txt
:
abc
abc
def
def
cvf
ref
tex
tdx
abc
Option 1: Find Duplicated Lines and Count them
By using the terminal and chaining two commands - sort
+ uniq
will help us to identify and show number of duplicated lines:
sort test.txt | uniq -c
The output will be:
3 abc
1 cvf
2 def
1 ref
1 tdx
1 tex
To list only the duplicated lines you can use grep command or additional parameter:
sort test.txt | uniq -cd
or
sort test.txt | uniq -c | grep -v '^ *1 '
which will show:
3 abc
2 def
Option 2: Find Duplicate Lines in CSV file with terminal
Let's say that you need to find the duplicated lines in a CSV file only in some columns. If you like to check the whole line and there's no row number you can use the previous option.
Let's have the next CSV file - test.csv
:
a,1,3
b,2,3
c,3,4
d,5,5
b,2,4
c,f,4
a,2,3
If we like to get the duplication based on column 1 and 3 we can use a command like:
awk -F, 'NR==FNR {A[$1,$3]++; next} A[$1,$3]>1 && !B[$0]++' test.csv test.csv
This will result in(note that file repetition is needed):
a,1,3
c,3,4
c,f,4
a,2,3
Option 3: Use Pandas to find duplicate lines
Finally we can see how to find duplicates with a powerful tool called Pandas. It will help you to find duplicates in many different file types and in many different ways.
You need to install Pandas only since Python is installed by default on Linux Mint. To install Pandas in Linux Mint use the next command:
pip install pandas
Next you can read the file where you need to find the duplication:
import pandas as pd
df = pd.read_csv("test.csv")
or JSON files:
df = pd.read_json("test.json")
To find and list all duplicated lines use:
df[df.duplicated(keep=False)]
or for a specific column use:
df[df.duplicated(['ID'], keep=False)]
You can find really nice and comprehensive video on the topic here: Python Pandas find and drop duplicate data