Skip to main content

How to Remove Duplicate Lines in Bash

• 2 min read
bash uniq deduplication text processing sort duplicate lines

Quick Answer: Remove Duplicate Lines in Bash

To remove duplicates, use sort -u file.txt which sorts and removes duplicates in one command. For preserving order, use awk '!seen[$0]++' file.txt. The sort -u method is fastest; awk preserves original order.

Quick Comparison: Deduplication Methods

MethodSyntaxSpeedOrder PreservedBest For
sort -usort -u file.txtFastestNoGeneral use
uniqsort file | uniqFastNoConsecutive dups
awk !seenawk '!seen[$0]++'MediumYesOrder matters
grep -EDependsVariableYesSpecific patterns

Bottom line: Use sort -u for simplicity and speed; use awk when order matters.


Remove duplicate lines from text files efficiently. Whether you’re cleaning up log files, deduplicating data exports, or removing redundant entries, understanding different deduplication methods is essential for data processing.

Method 1: Using sort -u (Fastest)

The sort -u command removes duplicates and sorts in one pass:

# Sort and remove duplicates
sort file.txt | uniq

# Remove duplicates (must be sorted)
sort file.txt | uniq > output.txt

# Remove duplicates in-place
sort file.txt | uniq > temp.txt && mv temp.txt file.txt

# Case-insensitive deduplication
sort -f file.txt | uniq -i

Example with sample file:

# Input (lines.txt):
apple
banana
apple
cherry
banana
date

# Command: sort lines.txt | uniq
# Output:
apple
banana
cherry
date

Method 2: Using sort -u Flag

The fastest method for removing duplicates - combine sorting and deduplication in one pass.

# Sort and remove duplicates in one command
sort -u file.txt

# Sort unique and save to file
sort -u input.txt > output.txt

# Numeric sort with deduplication
sort -n -u numbers.txt

# Reverse sort with deduplication
sort -r -u file.txt

Example:

# Input file
5
2
5
1
2
3

# Command: sort -u numbers.txt
# Output:
1
2
3
5

Method 3: Using awk (Preserves Order)

Unlike sort-based methods, awk preserves the original order of first occurrence.

# Remove duplicates preserving order
awk '!seen[$0]++' file.txt

# Remove duplicates and sort
awk '!seen[$0]++' file.txt | sort

# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt

# Count occurrences while removing
awk '!seen[$0]++ {print $0, ++count[NR]}' file.txt

Example:

# Input:
apple
banana
apple
cherry

# Command: awk '!seen[$0]++' file.txt
# Output (preserves order):
apple
banana
cherry

Method 4: Count Duplicates

Often you need to see how many times each line appears.

# Show count of each line
sort file.txt | uniq -c

# Show only lines that appear more than once
sort file.txt | uniq -c | awk '$1 > 1'

# Show only lines that appear exactly once
sort file.txt | uniq -u

# Show only duplicate lines
sort file.txt | uniq -d

Example:

# Input:
apple
banana
apple
cherry
apple
banana

# Show counts:
sort file.txt | uniq -c
# Output:
3 apple
2 banana
1 cherry

# Show duplicates only:
sort file.txt | uniq -d
# Output:
apple
banana

# Show unique lines only:
sort file.txt | uniq -u
# Output:
cherry

Practical Examples

Example 1: Remove Duplicates from Log File

#!/bin/bash

log_file="$1"

if [ ! -f "$log_file" ]; then
  echo "Log file not found"
  exit 1
fi

output_file="${log_file%.log}_dedup.log"

# Preserve timestamp order while removing duplicate messages
awk '!seen[$0]++' "$log_file" > "$output_file"

echo "Deduplicated log: $output_file"

Input:

2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed

Output:

2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed

Example 2: Remove Duplicates with Statistics

#!/bin/bash

input_file="$1"

# Get original line count
original_count=$(wc -l < "$input_file")

# Remove duplicates and count
sort -u "$input_file" > /tmp/dedup.txt
dedup_count=$(wc -l < /tmp/dedup.txt)

duplicates=$((original_count - dedup_count))

echo "Original lines: $original_count"
echo "After dedup: $dedup_count"
echo "Duplicates removed: $duplicates"

mv /tmp/dedup.txt "$input_file"

Output:

Original lines: 100
After dedup: 87
Duplicates removed: 13

Example 3: Remove Duplicates by Specific Field

#!/bin/bash

# Remove duplicates based on specific field (e.g., username)
csv_file="$1"
field="${2:-1}"

# Use awk to track seen values in specific field
awk -F',' '!seen[$'"$field"']++ {print}' "$csv_file" > "${csv_file%.csv}_dedup.csv"

echo "Deduplicated by field $field"

Usage:

# Remove duplicate users by name (field 2)
bash script.sh users.csv 2

Example 4: Remove Case-Insensitive Duplicates

#!/bin/bash

file="$1"

# Remove duplicates ignoring case, keeping first occurrence
awk '!seen[tolower($0)]++ {print}' "$file" > "${file}_dedup"

echo "Case-insensitive deduplication complete"

Example:

# Input:
Hello
HELLO
hello
World

# Output:
Hello
World

Example 5: Function for Deduplication

#!/bin/bash

# Flexible deduplication function
deduplicate() {
  local input="$1"
  local output="${2:-.}"
  local method="${3:-sort-u}"
  local preserve_order="${4:-false}"

  if [ ! -f "$input" ]; then
    echo "Error: File not found"
    return 1
  fi

  case "$method" in
    sort-u)
      sort -u "$input" > "$output"
      ;;
    uniq)
      sort "$input" | uniq > "$output"
      ;;
    awk)
      awk '!seen[$0]++' "$input" > "$output"
      ;;
    *)
      echo "Unknown method: $method"
      return 1
      ;;
  esac

  echo "Deduplicated: $input -> $output"
}

# Usage
deduplicate "input.txt" "output.txt" "awk"

Example 6: Remove Duplicates with Preservation of Position

#!/bin/bash

# Remove duplicates while maintaining relative order
file="$1"

# Show first occurrence of each line with its line number
awk '!seen[$0]++ {print NR, $0}' "$file" | \
  sort -n -k1 | \
  cut -d' ' -f2-

Example 7: Compare Files and Remove Matching Lines

#!/bin/bash

file1="$1"
file2="$2"

# Remove lines from file1 that appear in file2
awk 'NR==FNR {seen[$0]=1; next} !seen[$0]' "$file2" "$file1"

# This removes duplicates between two files

Performance Comparison

For removing duplicates from large files:

MethodSpeedMemoryPreserves Order
sort -uFastestLowNo
sort | uniqVery FastLowNo
awkMediumMediumYes

Best choice: Use sort -u for speed, awk '!seen[$0]++' to preserve order.

Important Considerations

Memory Usage

For very large files, streaming methods are better:

# Good for large files (doesn't load all at once)
sort input.txt | uniq > output.txt

# Using awk (loads into memory)
awk '!seen[$0]++' input.txt

Preserving Order

If original line order matters:

# Preserves order (order of first occurrence)
awk '!seen[$0]++' file.txt

# Does NOT preserve order
sort -u file.txt

Case Sensitivity

By default, deduplication is case-sensitive:

# Case-sensitive (Hello and hello are different)
awk '!seen[$0]++' file.txt

# Case-insensitive
awk '!seen[tolower($0)]++' file.txt

Trailing Whitespace

Trailing whitespace affects matching:

# Remove whitespace before deduplication
sed 's/[[:space:]]*$//' file.txt | awk '!seen[$0]++'

# Or trim leading whitespace too
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' file.txt | awk '!seen[$0]++'

Key Points

  • Use sort -u for fastest deduplication (changes order)
  • Use awk '!seen[$0]++' to remove duplicates and preserve order
  • Use uniq -c to count occurrences
  • Use uniq -d to show only duplicates
  • Remember sort-based methods change line order
  • Watch out for whitespace differences
  • Consider case sensitivity requirements

Quick Reference

# Remove duplicates, sort order
sort -u file.txt

# Remove duplicates, preserve order
awk '!seen[$0]++' file.txt

# Count each line
sort file.txt | uniq -c

# Show only duplicates
sort file.txt | uniq -d

# Show only unique lines
sort file.txt | uniq -u

# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt
#!/bin/bash

file="$1"

# For quick deduplication (order doesn't matter):
sort -u "$file" > "${file%.txt}_dedup.txt"

# For preserving order:
awk '!seen[$0]++' "$file" > "${file%.txt}_dedup.txt"

# For analysis (show what was duplicated):
echo "Duplicates found:"
sort "$file" | uniq -d
echo "Removed: $(($(wc -l < "$file") - $(sort -u "$file" | wc -l))) lines"