21 April 2007

Simple Command Line Text Processing

Impromptu

I almost didn't write this, because I decided to check the latest goings on at GnuJersey.org, and happened to see Dave posted something about sed. I didn't want to be a copy cat. Then I remembered that nothing is original anymore. Plus, you will find no mention of sed (other than this and the previous one) or anything else in Dave's post here. so here we go.


The other day I needed to sort a few dozen lines of text. For some reason, the behemoth text editor I was using didn't have a sort function. (What gives?) A commercial competitor I recently switch from had this functionality, but I didn't want to reinstall it just for one quick sort. Instead, I turn to the old standby of the UNIX user: piping for text processing!

In order to demonstrate some key functionality, I am going to operate on the contents of a simple text file. The file contains the top 10 most common passwords and respective frequencies, according to this article. For reference, here are the contents of the file:

letmein 1.76%
thomas 0.99%
arsenal 1.11%
monkey 1.33%
charlie 1.39%
qwerty 1.41%
qwerty 1.41%
123456 1.63%
letmein 1.76%
liverpool 1.82%
password 3.780%
arsenal 1.11%
123 3.784%


In order to understand what is going on, you need to be familiar with 3 input redirectors used in the UNIX world. They are as follows.


  • < reads data from a file into the command on the left

  • > writes data from a command into a file on the right

  • | forwards data from one command to the next



Input redirection is actually far more complicated and capable than that, but that is another post.

The first key command is sort. sort does what you may expert, it sorts lines in a text file.

$ sort < demo.txt
123 3.784%
123456 1.63%
arsenal 1.11%
arsenal 1.11%
charlie 1.39%
letmein 1.76%
letmein 1.76%
liverpool 1.82%
monkey 1.33%
password 3.780%
qwerty 1.41%
qwerty 1.41%
thomas 0.99%


In order to clean things up, sort is often combined with uniq. uniq preserves unique lines in a text file, but it has a serious limitation: it only works when duplicate lines are adjacent. Therefore, if you want to use uniq, use sort first.

$ sort < demo.txt | uniq
123 3.784%
123456 1.63%
arsenal 1.11%
charlie 1.39%
letmein 1.76%
liverpool 1.82%
monkey 1.33%
password 3.780%
qwerty 1.41%
thomas 0.99%


Another handy one is cut. cut splits lines into fields (think spreadsheet) and allows you to cut out the fields you want to keep. cut can work with any simple field delimiter given by the -d option. A comma separated list of fields to display is given by the -f option.

$ sort < demo.txt | uniq | cut -d ' ' -f 1
123
123456
arsenal
charlie
letmein
liverpool
monkey
password
qwerty
thomas


If you want to get back to your command line roots, or just like the speed and simplicity of throwing a few small, highly specialized commands at a your problems, keep exploring. There are several more fun text processing commands available on most UNIX-like operating systems. More complicated solutions like awk and sed open up even more quick solutions to common text processing problems.

No comments: