Tuesday, July 5, 2011

Word Frequency Analysis

Simple Bash script used to determine the frequency of words in a text file called "mydatafile".

#!/bin/sh
cat mydatafile \
        | tr '[:punct:]'  ' ' \
        | tr '[:digit:]'  ' ' \
        | tr ' ' '\012' \
        | sort \
        | uniq -c \
        | sort -n \
        | tail -700

Line by line walk through of the above script:

First, cat the text file to the standard output and pipe into a few tr commands.
The first tr removes any punctuation and output piped into a second tr which removes the numbers and the output is piped into a third tr command that places every word (group of letters separated by a ' ' ) on a new line.

The output is run through a sort command, arranging the words in a sorted list and the output is piped into a uniq -c command that counts the unique words outputting two columns; the count and the word, which is piped into a sort -n to sort the list ascending order of words.

The last line cuts the list to only display the top 700 most frequent words.