cleaned_example.txt` cleans the file by deleting non-printable characters using specific octal character codes. The blog also includes a script example for practical implementation and highlights `tr` as a potent text manipulation tool in Unix-like systems."/> cleaned_example.txt` cleans the file by deleting non-printable characters using specific octal character codes. The blog also includes a script example for practical implementation and highlights `tr` as a potent text manipulation tool in Unix-like systems." />
Posted on
Questions and Answers

Use `tr` to delete non-printable Unicode characters

Author
  • User
    Linux Bash
    Posts by this author
    Posts by this author

Blog Article: Using tr to Delete Non-printable Unicode Characters in Linux Bash

When working with text files in a Linux environment, you might encounter issues with non-printable characters, which can disrupt file processing or display. In this post, we’ll explore how to use the tr command to handle these pesky characters efficiently.

Q1: What is the tr command in Linux Bash?

A1: tr stands for "translate" or "transliterate". It is a useful command-line utility in Unix-like operating systems, including Linux, for translating, deleting, or squeezing repeated characters. It reads from the standard input and writes to the standard output.

Q2: How can tr be used to delete non-printable Unicode characters?

A2: To delete non-printable Unicode characters, tr can be paired with character classes that specify the range or type of characters to target. For Unicode, this might involve specifying the range like [:print:], which represents all printable characters, and using the -c (complement) and -d (delete) options to remove characters not in this class.

Q3: Can you give a practical example of using tr to delete non-printable characters?

A3: Certainly! Suppose you have a text file named "example.txt" that contains a mix of printable and non-printable characters. To remove all non-printable characters from the file, you can use the following command:

cat example.txt | tr -cd '\11\12\15\40-\176' > cleaned_example.txt

This command uses a range of octal character codes:

  • \11 is the octal code for horizontal tab.

  • \12 is the octal code for new line.

  • \15 is the octal code for carriage return.

  • \40-\176 covers the range of printable ASCII characters.

Background on the Topic

The tr command operates by either deleting specified characters or replacing one set of characters with another. Here are a couple more examples to show its versatility:

  • Convert lowercase to uppercase:

    echo "hello world" | tr 'a-z' 'A-Z'
    

    This command translates all lowercase letters to uppercase.

  • Delete digits:

    echo "123 Easy Street" | tr -d '0-9'
    

    This removes all digits from the input string, outputting " Easy Street".

Executable Script Demonstrating tr

Now, let’s create an executable script to demonstrate how tr can clean a text file by removing non-printable characters:

#!/bin/bash

# Ensure a file name is provided
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <filename>"
    exit 1
fi

input_file=$1
output_file="cleaned_$input_file"

# Remove non-printable characters
tr -cd '\11\12\15\40-\176' < "$input_file" > "$output_file"

echo "Processed file saved as $output_file"

Save this script as clean_text.sh, make it executable with chmod +x clean_text.sh, and run it by passing a filename as an argument.

Conclusion

The tr command is a powerful tool in the Linux toolkit, particularly useful for manipulating text data - translating character sets or purging unwanted characters. By mastering tr, you can efficiently manage text processing tasks in your scripts or command-line operations, keeping your data clean and standardized with minimal effort.

Further Reading

For further reading and resources related to the tr command in Linux, consider exploring these links:

These resources should provide a more comprehensive understanding of text manipulation in Linux environments, enhancing your skills with the tr command and beyond.