How to Remove Duplicate Lines in Bash
Quick Answer: Remove Duplicate Lines in Bash
To remove duplicates, use sort -u file.txt which sorts and removes duplicates in one command. For preserving order, use awk '!seen[$0]++' file.txt. The sort -u method is fastest; awk preserves original order.
Quick Comparison: Deduplication Methods
| Method | Syntax | Speed | Order Preserved | Best For |
|---|---|---|---|---|
| sort -u | sort -u file.txt | Fastest | No | General use |
| uniq | sort file | uniq | Fast | No | Consecutive dups |
| awk !seen | awk '!seen[$0]++' | Medium | Yes | Order matters |
| grep -E | Depends | Variable | Yes | Specific patterns |
Bottom line: Use sort -u for simplicity and speed; use awk when order matters.
Remove duplicate lines from text files efficiently. Whether you’re cleaning up log files, deduplicating data exports, or removing redundant entries, understanding different deduplication methods is essential for data processing.
Method 1: Using sort -u (Fastest)
The sort -u command removes duplicates and sorts in one pass:
# Sort and remove duplicates
sort file.txt | uniq
# Remove duplicates (must be sorted)
sort file.txt | uniq > output.txt
# Remove duplicates in-place
sort file.txt | uniq > temp.txt && mv temp.txt file.txt
# Case-insensitive deduplication
sort -f file.txt | uniq -i
Example with sample file:
# Input (lines.txt):
apple
banana
apple
cherry
banana
date
# Command: sort lines.txt | uniq
# Output:
apple
banana
cherry
date
Method 2: Using sort -u Flag
The fastest method for removing duplicates - combine sorting and deduplication in one pass.
# Sort and remove duplicates in one command
sort -u file.txt
# Sort unique and save to file
sort -u input.txt > output.txt
# Numeric sort with deduplication
sort -n -u numbers.txt
# Reverse sort with deduplication
sort -r -u file.txt
Example:
# Input file
5
2
5
1
2
3
# Command: sort -u numbers.txt
# Output:
1
2
3
5
Method 3: Using awk (Preserves Order)
Unlike sort-based methods, awk preserves the original order of first occurrence.
# Remove duplicates preserving order
awk '!seen[$0]++' file.txt
# Remove duplicates and sort
awk '!seen[$0]++' file.txt | sort
# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt
# Count occurrences while removing
awk '!seen[$0]++ {print $0, ++count[NR]}' file.txt
Example:
# Input:
apple
banana
apple
cherry
# Command: awk '!seen[$0]++' file.txt
# Output (preserves order):
apple
banana
cherry
Method 4: Count Duplicates
Often you need to see how many times each line appears.
# Show count of each line
sort file.txt | uniq -c
# Show only lines that appear more than once
sort file.txt | uniq -c | awk '$1 > 1'
# Show only lines that appear exactly once
sort file.txt | uniq -u
# Show only duplicate lines
sort file.txt | uniq -d
Example:
# Input:
apple
banana
apple
cherry
apple
banana
# Show counts:
sort file.txt | uniq -c
# Output:
3 apple
2 banana
1 cherry
# Show duplicates only:
sort file.txt | uniq -d
# Output:
apple
banana
# Show unique lines only:
sort file.txt | uniq -u
# Output:
cherry
Practical Examples
Example 1: Remove Duplicates from Log File
#!/bin/bash
log_file="$1"
if [ ! -f "$log_file" ]; then
echo "Log file not found"
exit 1
fi
output_file="${log_file%.log}_dedup.log"
# Preserve timestamp order while removing duplicate messages
awk '!seen[$0]++' "$log_file" > "$output_file"
echo "Deduplicated log: $output_file"
Input:
2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed
Output:
2024-12-25 10:15 Connection established
2024-12-25 10:16 Data transferred
2024-12-25 10:17 Connection closed
Example 2: Remove Duplicates with Statistics
#!/bin/bash
input_file="$1"
# Get original line count
original_count=$(wc -l < "$input_file")
# Remove duplicates and count
sort -u "$input_file" > /tmp/dedup.txt
dedup_count=$(wc -l < /tmp/dedup.txt)
duplicates=$((original_count - dedup_count))
echo "Original lines: $original_count"
echo "After dedup: $dedup_count"
echo "Duplicates removed: $duplicates"
mv /tmp/dedup.txt "$input_file"
Output:
Original lines: 100
After dedup: 87
Duplicates removed: 13
Example 3: Remove Duplicates by Specific Field
#!/bin/bash
# Remove duplicates based on specific field (e.g., username)
csv_file="$1"
field="${2:-1}"
# Use awk to track seen values in specific field
awk -F',' '!seen[$'"$field"']++ {print}' "$csv_file" > "${csv_file%.csv}_dedup.csv"
echo "Deduplicated by field $field"
Usage:
# Remove duplicate users by name (field 2)
bash script.sh users.csv 2
Example 4: Remove Case-Insensitive Duplicates
#!/bin/bash
file="$1"
# Remove duplicates ignoring case, keeping first occurrence
awk '!seen[tolower($0)]++ {print}' "$file" > "${file}_dedup"
echo "Case-insensitive deduplication complete"
Example:
# Input:
Hello
HELLO
hello
World
# Output:
Hello
World
Example 5: Function for Deduplication
#!/bin/bash
# Flexible deduplication function
deduplicate() {
local input="$1"
local output="${2:-.}"
local method="${3:-sort-u}"
local preserve_order="${4:-false}"
if [ ! -f "$input" ]; then
echo "Error: File not found"
return 1
fi
case "$method" in
sort-u)
sort -u "$input" > "$output"
;;
uniq)
sort "$input" | uniq > "$output"
;;
awk)
awk '!seen[$0]++' "$input" > "$output"
;;
*)
echo "Unknown method: $method"
return 1
;;
esac
echo "Deduplicated: $input -> $output"
}
# Usage
deduplicate "input.txt" "output.txt" "awk"
Example 6: Remove Duplicates with Preservation of Position
#!/bin/bash
# Remove duplicates while maintaining relative order
file="$1"
# Show first occurrence of each line with its line number
awk '!seen[$0]++ {print NR, $0}' "$file" | \
sort -n -k1 | \
cut -d' ' -f2-
Example 7: Compare Files and Remove Matching Lines
#!/bin/bash
file1="$1"
file2="$2"
# Remove lines from file1 that appear in file2
awk 'NR==FNR {seen[$0]=1; next} !seen[$0]' "$file2" "$file1"
# This removes duplicates between two files
Performance Comparison
For removing duplicates from large files:
| Method | Speed | Memory | Preserves Order |
|---|---|---|---|
| sort -u | Fastest | Low | No |
| sort | uniq | Very Fast | Low | No |
| awk | Medium | Medium | Yes |
Best choice: Use sort -u for speed, awk '!seen[$0]++' to preserve order.
Important Considerations
Memory Usage
For very large files, streaming methods are better:
# Good for large files (doesn't load all at once)
sort input.txt | uniq > output.txt
# Using awk (loads into memory)
awk '!seen[$0]++' input.txt
Preserving Order
If original line order matters:
# Preserves order (order of first occurrence)
awk '!seen[$0]++' file.txt
# Does NOT preserve order
sort -u file.txt
Case Sensitivity
By default, deduplication is case-sensitive:
# Case-sensitive (Hello and hello are different)
awk '!seen[$0]++' file.txt
# Case-insensitive
awk '!seen[tolower($0)]++' file.txt
Trailing Whitespace
Trailing whitespace affects matching:
# Remove whitespace before deduplication
sed 's/[[:space:]]*$//' file.txt | awk '!seen[$0]++'
# Or trim leading whitespace too
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' file.txt | awk '!seen[$0]++'
Key Points
- Use
sort -ufor fastest deduplication (changes order) - Use
awk '!seen[$0]++'to remove duplicates and preserve order - Use
uniq -cto count occurrences - Use
uniq -dto show only duplicates - Remember sort-based methods change line order
- Watch out for whitespace differences
- Consider case sensitivity requirements
Quick Reference
# Remove duplicates, sort order
sort -u file.txt
# Remove duplicates, preserve order
awk '!seen[$0]++' file.txt
# Count each line
sort file.txt | uniq -c
# Show only duplicates
sort file.txt | uniq -d
# Show only unique lines
sort file.txt | uniq -u
# Case-insensitive deduplication
awk '!seen[tolower($0)]++' file.txt
Recommended Pattern
#!/bin/bash
file="$1"
# For quick deduplication (order doesn't matter):
sort -u "$file" > "${file%.txt}_dedup.txt"
# For preserving order:
awk '!seen[$0]++' "$file" > "${file%.txt}_dedup.txt"
# For analysis (show what was duplicated):
echo "Duplicates found:"
sort "$file" | uniq -d
echo "Removed: $(($(wc -l < "$file") - $(sort -u "$file" | wc -l))) lines"