How to Remove Duplicate Lines in Bash

Quick Answer: Remove Duplicate Lines in Bash

To remove duplicates, use sort -u file.txt which sorts and removes duplicates in one command. For preserving order, use awk '!seen[$0]++' file.txt. The sort -u method is fastest; awk preserves original order.

Quick Comparison: Deduplication Methods

Method	Syntax	Speed	Order Preserved	Best For
sort -u	`sort -u file.txt`	Fastest	No	General use
uniq	`sort file \| uniq`	Fast	No	Consecutive dups
awk !seen	`awk '!seen[$0]++'`	Medium	Yes	Order matters
grep -E	Depends	Variable	Yes	Specific patterns

Bottom line: Use sort -u for simplicity and speed; use awk when order matters.

Remove duplicate lines from text files efficiently. Whether you’re cleaning up log files, deduplicating data exports, or removing redundant entries, understanding different deduplication methods is essential for data processing.

Method 1: Using sort -u (Fastest)

The sort -u command removes duplicates and sorts in one pass:

# Sort and remove duplicates
sort file.txt | uniq

# Remove duplicates (must be sorted)
sort file.txt | uniq > output.txt

# Remove duplicates in-place
sort file.txt | uniq > temp.txt && mv temp.txt file.txt

# Case-insensitive deduplication
sort -f file.txt | uniq -i

Example with sample file:

# Input (lines.txt):
apple
banana
apple
cherry
banana
date

# Command: sort lines.txt | uniq
# Output:
apple
banana
cherry
date

Method 2: Using sort -u Flag

The fastest method for removing duplicates - combine sorting and deduplication in one pass.

# Sort and remove duplicates in one command
sort -u file.txt

# Sort unique and save to file
sort -u input.txt > output.txt

# Numeric sort with deduplication
sort -n -u numbers.txt

# Reverse sort with deduplication
sort -r -u file.txt

Example:

# Input file
5
2
5
1
2
3

# Command: sort -u numbers.txt
# Output:
1
2
3
5

Method 3: Using awk (Preserves Order)

Unlike sort-based methods, awk preserves the original order of first occurrence.

# Remove duplicates preserving order
awk '!seen[$0]++' file.txt

# Remove duplicates and sort
awk '!seen[$0]++' file.txt | sort

# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt

# Count occurrences while removing
awk '!seen[$0]++ {print $0, ++count[NR]}' file.txt

Example:

# Input:
apple
banana
apple
cherry

# Command: awk '!seen[$0]++' file.txt
# Output (preserves order):
apple
banana
cherry

Method 4: Count Duplicates

Often you need to see how many times each line appears.

# Show count of each line
sort file.txt | uniq -c

# Show only lines that appear more than once
sort file.txt | uniq -c | awk '$1 > 1'

# Show only lines that appear exactly once
sort file.txt | uniq -u

# Show only duplicate lines
sort file.txt | uniq -d

Example:

# Input:
apple
banana
apple
cherry
apple
banana

# Show counts:
sort file.txt | uniq -c
# Output:
3 apple
2 banana
1 cherry

# Show duplicates only:
sort file.txt | uniq -d
# Output:
apple
banana

# Show unique lines only:
sort file.txt | uniq -u
# Output:
cherry

Practical Examples

Example 1: Remove Duplicates from Log File

#!/bin/bash

log_file="$1"

if [ ! -f "$log_file" ]; then
  echo "Log file not found"
  exit 1
fi

output_file="${log_file%.log}_dedup.log"

# Preserve timestamp order while removing duplicate messages
awk '!seen[$0]++' "$log_file" > "$output_file"

echo "Deduplicated log: $output_file"

Input:

2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed

Output:

2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed

Example 2: Remove Duplicates with Statistics

#!/bin/bash

input_file="$1"

# Get original line count
original_count=$(wc -l < "$input_file")

# Remove duplicates and count
sort -u "$input_file" > /tmp/dedup.txt
dedup_count=$(wc -l < /tmp/dedup.txt)

duplicates=$((original_count - dedup_count))

echo "Original lines: $original_count"
echo "After dedup: $dedup_count"
echo "Duplicates removed: $duplicates"

mv /tmp/dedup.txt "$input_file"

Output:

Original lines: 100
After dedup: 87
Duplicates removed: 13

Example 3: Remove Duplicates by Specific Field

#!/bin/bash

# Remove duplicates based on specific field (e.g., username)
csv_file="$1"
field="${2:-1}"

# Use awk to track seen values in specific field
awk -F',' '!seen[$'"$field"']++ {print}' "$csv_file" > "${csv_file%.csv}_dedup.csv"

echo "Deduplicated by field $field"

Usage:

# Remove duplicate users by name (field 2)
bash script.sh users.csv 2

Example 4: Remove Case-Insensitive Duplicates

#!/bin/bash

file="$1"

# Remove duplicates ignoring case, keeping first occurrence
awk '!seen[tolower($0)]++ {print}' "$file" > "${file}_dedup"

echo "Case-insensitive deduplication complete"

Example:

# Input:
Hello
HELLO
hello
World

# Output:
Hello
World

Example 5: Function for Deduplication

#!/bin/bash

# Flexible deduplication function
deduplicate() {
  local input="$1"
  local output="${2:-.}"
  local method="${3:-sort-u}"
  local preserve_order="${4:-false}"

  if [ ! -f "$input" ]; then
    echo "Error: File not found"
    return 1
  fi

  case "$method" in
    sort-u)
      sort -u "$input" > "$output"
      ;;
    uniq)
      sort "$input" | uniq > "$output"
      ;;
    awk)
      awk '!seen[$0]++' "$input" > "$output"
      ;;
    *)
      echo "Unknown method: $method"
      return 1
      ;;
  esac

  echo "Deduplicated: $input -> $output"
}

# Usage
deduplicate "input.txt" "output.txt" "awk"

Example 6: Remove Duplicates with Preservation of Position

#!/bin/bash

# Remove duplicates while maintaining relative order
file="$1"

# Show first occurrence of each line with its line number
awk '!seen[$0]++ {print NR, $0}' "$file" | \
  sort -n -k1 | \
  cut -d' ' -f2-

Example 7: Compare Files and Remove Matching Lines

#!/bin/bash

file1="$1"
file2="$2"

# Remove lines from file1 that appear in file2
awk 'NR==FNR {seen[$0]=1; next} !seen[$0]' "$file2" "$file1"

# This removes duplicates between two files

Performance Comparison

For removing duplicates from large files:

Method	Speed	Memory	Preserves Order
sort -u	Fastest	Low	No
sort \| uniq	Very Fast	Low	No
awk	Medium	Medium	Yes

Best choice: Use sort -u for speed, awk '!seen[$0]++' to preserve order.

Important Considerations

Memory Usage

For very large files, streaming methods are better:

# Good for large files (doesn't load all at once)
sort input.txt | uniq > output.txt

# Using awk (loads into memory)
awk '!seen[$0]++' input.txt

Preserving Order

If original line order matters:

# Preserves order (order of first occurrence)
awk '!seen[$0]++' file.txt

# Does NOT preserve order
sort -u file.txt

Case Sensitivity

By default, deduplication is case-sensitive:

# Case-sensitive (Hello and hello are different)
awk '!seen[$0]++' file.txt

# Case-insensitive
awk '!seen[tolower($0)]++' file.txt

Trailing Whitespace

Trailing whitespace affects matching:

# Remove whitespace before deduplication
sed 's/[[:space:]]*$//' file.txt | awk '!seen[$0]++'

# Or trim leading whitespace too
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' file.txt | awk '!seen[$0]++'

Key Points

Use sort -u for fastest deduplication (changes order)
Use awk '!seen[$0]++' to remove duplicates and preserve order
Use uniq -c to count occurrences
Use uniq -d to show only duplicates
Remember sort-based methods change line order
Watch out for whitespace differences
Consider case sensitivity requirements

Quick Reference

# Remove duplicates, sort order
sort -u file.txt

# Remove duplicates, preserve order
awk '!seen[$0]++' file.txt

# Count each line
sort file.txt | uniq -c

# Show only duplicates
sort file.txt | uniq -d

# Show only unique lines
sort file.txt | uniq -u

# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt

Recommended Pattern

#!/bin/bash

file="$1"

# For quick deduplication (order doesn't matter):
sort -u "$file" > "${file%.txt}_dedup.txt"

# For preserving order:
awk '!seen[$0]++' "$file" > "${file%.txt}_dedup.txt"

# For analysis (show what was duplicated):
echo "Duplicates found:"
sort "$file" | uniq -d
echo "Removed: $(($(wc -l < "$file") - $(sort -u "$file" | wc -l))) lines"

Quick Answer: Remove Duplicate Lines in Bash

Quick Comparison: Deduplication Methods

Method 1: Using sort -u (Fastest)

Method 2: Using sort -u Flag

Method 3: Using awk (Preserves Order)

Method 4: Count Duplicates

Practical Examples

Example 1: Remove Duplicates from Log File

Example 2: Remove Duplicates with Statistics

Example 3: Remove Duplicates by Specific Field

Example 4: Remove Case-Insensitive Duplicates

Example 5: Function for Deduplication

Example 6: Remove Duplicates with Preservation of Position

Example 7: Compare Files and Remove Matching Lines

Performance Comparison

Important Considerations

Memory Usage

Preserving Order

Case Sensitivity

Trailing Whitespace

Key Points

Quick Reference

Recommended Pattern

Related Articles

How to Count Unique Lines in Bash

How to Process Fields with Awk

How to Extract Column from CSV in Bash