Skip to main content

How to Find Duplicate Files

• 1 min read
bash

Quick Answer: Find Duplicate Files

To find duplicate files in Bash, use checksums to identify identical content: md5sum * | sort | uniq -d -w 32. For faster preliminary filtering, check file sizes first: find . -type f -printf '%s %p\n' | sort -n.

Quick Comparison: Duplicate File Detection Methods

MethodSpeedAccuracyBest For
Size checkVery fastFalse positivesPreliminary
md5sumFastPerfectExact matches
sha256sumSlowerPerfectSecurity check
cmpMediumPerfectFile-by-file
fdupes toolFastPerfectBulk finding

Bottom line: Use size check first, then md5sum to confirm duplicates.


Find duplicate files using checksums, size, or file comparison.

Find by Size First

# Files with same size (potential duplicates)
find . -type f -printf '%s %p\n' | sort -n | uniq -d -w 20

Using md5sum

# Generate checksums
find . -type f -exec md5sum {} \; | sort

# Find duplicates (same checksum)
find . -type f -exec md5sum {} \; | sort | uniq -d -w 32

Practical Example: Find Duplicates

#!/bin/bash

# File: find_duplicates.sh

directory="${1:-.}"

echo "=== Finding Duplicate Files ==="
echo ""

# Create temp file with checksums
tmpfile=$(mktemp)
trap "rm -f $tmpfile" EXIT

find "$directory" -type f -exec md5sum {} \; | \
  sort > "$tmpfile"

# Find duplicates (same checksum appears 2+ times)
awk '
{
  if (seen[$1]++) {
    if (!printed[$1]++) {
      print "Duplicate checksum:", $1
      print "  " seen[$1] "-", prev[$1]
    }
    print "  " (seen[$1]), $2
  }
  prev[$1] = $2
}
' "$tmpfile"

Find by Filename

# Files with same name
find . -type f -name "*.txt" | sort | uniq -d

Compare Files Directly

#!/bin/bash

# Find identical files
find . -type f -print0 | xargs -0 md5sum | \
  sort | \
  awk '{print $1}' | \
  uniq -d -c | \
  while read count hash; do
    echo "Found $count identical files (hash: $hash)"
  fi

Remove Duplicates

#!/bin/bash

# Keep first, remove duplicates
declare -A seen
find . -type f | while read file; do
  hash=$(md5sum "$file" | cut -d' ' -f1)
  if [ -n "${seen[$hash]}" ]; then
    echo "Removing duplicate: $file"
    rm "$file"
  else
    seen[$hash]=1
  fi
done

Summary

Use md5sum for reliable duplicate detection. Always verify before deleting.