#!/bin/sh
cat mydatafile \
| tr '[:punct:]' ' ' \
| tr '[:digit:]' ' ' \
| tr ' ' '\012' \
| sort \
| uniq -c \
| sort -n \
| tail -700
Line by line walk through of the above script:
First, cat the text file to the standard output and pipe into a few tr commands.
The first tr removes any punctuation and output piped into a second tr which removes the numbers and the output is piped into a third tr command that places every word (group of letters separated by a ' ' ) on a new line.
The output is run through a sort command, arranging the words in a sorted list and the output is piped into a uniq -c command that counts the unique words outputting two columns; the count and the word, which is piped into a sort -n to sort the list ascending order of words.
The last line cuts the list to only display the top 700 most frequent words.
No comments:
Post a Comment