How to find duplicated lines in files Linux Mint

This brief article will show you how to find duplicate lines in files in Linux Mint. There are several different ways to find and count duplicate lines and we will cover the most popular ones.

Suppose we have a text or CSV file with next content - test.txt:

abc
abc
def
def
cvf
ref
tex
tdx
abc

Option 1: Find Duplicated Lines and Count them

By using the terminal and chaining two commands - sort + uniq will help us to identify and show number of duplicated lines:

sort test.txt | uniq -c

The output will be:

  3 abc
  1 cvf
  2 def
  1 ref
  1 tdx
  1 tex

To list only the duplicated lines you can use grep command or additional parameter:

sort test.txt | uniq -cd

or

sort test.txt | uniq -c | grep -v '^ *1 '

which will show:

  3 abc
  2 def

Option 2: Find Duplicate Lines in CSV file with terminal

Let's say that you need to find the duplicated lines in a CSV file only in some columns. If you like to check the whole line and there's no row number you can use the previous option.

Let's have the next CSV file - test.csv:

a,1,3
b,2,3
c,3,4
d,5,5
b,2,4
c,f,4
a,2,3

If we like to get the duplication based on column 1 and 3 we can use a command like:

awk -F, 'NR==FNR {A[$1,$3]++; next} A[$1,$3]>1 && !B[$0]++' test.csv test.csv

This will result in(note that file repetition is needed):

a,1,3
c,3,4
c,f,4
a,2,3

Option 3: Use Pandas to find duplicate lines

Finally we can see how to find duplicates with a powerful tool called Pandas. It will help you to find duplicates in many different file types and in many different ways.

You need to install Pandas only since Python is installed by default on Linux Mint. To install Pandas in Linux Mint use the next command:

pip install pandas

Next you can read the file where you need to find the duplication:

import pandas as pd
df = pd.read_csv("test.csv")

or JSON files:

df = pd.read_json("test.json")

To find and list all duplicated lines use:

df[df.duplicated(keep=False)]

or for a specific column use:

df[df.duplicated(['ID'], keep=False)]

You can find really nice and comprehensive video on the topic here: Python Pandas find and drop duplicate data

Share Tweet Send
0 Comments
Loading...
You've successfully subscribed to SoftHints - Python, Data Science and Linux Tutorials
Great! Next, complete checkout for full access to SoftHints - Python, Data Science and Linux Tutorials
Welcome back! You've successfully signed in
Success! Your account is fully activated, you now have access to all content.