Posted on
Questions and Answers

Use `comm` to compare sorted files with custom delimiters

Author
  • User
    Linux Bash
    Posts by this author
    Posts by this author

Understanding the Use of comm Command With Custom Delimiters in Linux Bash

The comm command in Linux is an essential utility that compares two sorted files line by line, making it a valuable tool for many administrators and developers who handle text data. Typically, most tutorials cover its default usage with standard delimiters, but today, we'll dive into handling custom delimiters, which can significantly enhance this tool's flexibility.

Q&A: Using comm with Custom Delimiters

Q1: What is the comm command used for?

A1: The comm command is used to compare two sorted files. It outputs three columns by default: unique to file1, unique to file2, and common lines.

Q2: How does the comm handle file comparison by default?

A2: By default, comm expects that the files are sorted using the same order. If they are not sorted, the results are unpredictable.

Q3: Can comm handle files with custom delimiters, such as commas or tabs?

A3: Not directly. comm inherently operates on a per-line basis where lines are expected to be delimited by newline characters. However, with some pre-processing using tools like tr or awk, you can change the delimiters temporarily to make comm usable in those scenarios.

Q4: What's an example of comparing files that use a custom delimiter like commas?

A4: Imagine two CSV files sorted alphabetically by the first column. You want to compare these using comm. First, temporarily convert commas to newlines, use comm, then convert back if needed.

Background and Usage

comm is straightforward but underappreciated. For two files, file1 and file2 containing sorted names each on a new line, using comm would look like this:

comm file1.txt file2.txt

This command outputs three tab-separated columns as described previously.

However, suppose our files are not structured with lines but with another delimiter, such as semi-colons or tabs. We'll need to transform these files first to use comm effectively.

Let's take an example with semi-colon delimited files:

# file1.txt
apple;banana;mango

# file2.txt
banana;cherry;apple

To compare these using comm, first convert semicolons to newlines:

tr ';' '\n' < file1.txt > file1_new.txt
tr ';' '\n' < file2.txt > file2_new.txt
sort file1_new.txt > file1_sorted.txt
sort file2_new.txt > file2_sorted.txt
comm file1_sorted.txt file2_sorted.txt

Executable Script Example

Let's put this into a script:

#!/bin/bash

# Function to preprocess, sort, and use comm
compare_files_with_custom_delimiter() {
  local file1=$1
  local file2=$2
  local delimiter=$3

  # Transform and sort files
  tr "$delimiter" '\n' < "$file1" | sort > file1_sorted.txt
  tr "$delimiter" '\n' < "$file2" | sort > file2_sorted.txt

  # Compare using comm
  comm file1_sorted.txt file2_sorted.txt
}

# Usage
compare_files_with_custom_delimiter "file1.txt" "file2.txt" ";"

In this example, the function compare_files_with_custom_delimiter is created to take the filenames and a delimiter as arguments, transforming them according to our needs and then comparing them.

Conclusion

The comm command, while simple, becomes significantly more powerful when combined with other text processing tools like tr or awk. Understanding how to manipulate file contents and delimiters expands the usability of comm in various scenarios, especially in environments where structured data files are commonplace. As always in Linux, combining simple tools effectively leads to powerful solutions.

Further Reading

For further reading and to enhance your understanding of Linux commands similar to comm, consider the following resources: