Computer Biology: Word Frequency Analysis

Simple Bash script used to determine the frequency of words in a text file called "mydatafile".

#!/bin/sh
cat mydatafile \
        | tr '[:punct:]' ' ' \
        | tr '[:digit:]' ' ' \
        | tr ' ' '\012' \
        | sort \
        | uniq -c \
        | sort -n \
        | tail -700

Line by line walk through of the above script:

First, cat the text file to the standard output and pipe into a few tr commands.
The first tr removes any punctuation and output piped into a second tr which removes the numbers and the output is piped into a third tr command that places every word (group of letters separated by a ' ' ) on a new line.

The output is run through a sort command, arranging the words in a sorted list and the output is piped into a uniq -c command that counts the unique words outputting two columns; the count and the word, which is piped into a sort -n to sort the list ascending order of words.

The last line cuts the list to only display the top 700 most frequent words.

Computer Biology

Tuesday, July 5, 2011

Word Frequency Analysis

No comments:

Post a Comment