Word Frequencey Analyzers (WFAs)
- Accept one parameter - input file
- Open file - raise exception if it does not exist and exit the program
- Read input file, line by line
- For each line, tokenize (separate into words)
- Eliminate any punctuation characters ( period, comma, semi-colon, colon, question-mark, exclamation point and other non-alpha numeric characters)
- At the end of input file, produce the following output.
- Write an output file in the following format (sort it by descending order of frequency)word, frequencycount
Goal: Given an input file and an optional noise word file, print a frequency count of non-noise words in the input file
1. Change the name of the program to WordAnalyzer
2. Accept two parameters - input file and noise file
3. Input file is the same as the first version
4. Noise file contains a set of noise words separated by white space (one or more spaces, newlines, tabs)
5. Open both files - raise exception if one of them do not exist and exit program
6. Read noise file and store all the words in memory
7. Read input file, line by line
8. For each line, tokenize (separate into words)
9. Eliminate any punctuation characters ( period, comma, semi-colon, colon, question-mark, exclamation point and other non-alpha numeric characters)
10. Increment word count for all input words
11. Check the word against noise words -
- if it is a noise word, increment the noise-word count (so that we know how many noise words are in the text)
- if it is not a noise word, - increment word-count and word-frequency count
12. At the end of input file, produce the following output.
13. Count of input words, count of noise words in the input file, count of valid-words
14. Write an output file in the following format (sort it by descending order of frequency)
word, frequencycount
WFA-3: From Web Pages
Goal: Given a web page address (url), perform a word frequency analysis on the content and print the results. Reuse the modules/pacakges developed in Word Frequency Analyzer-2
- Accept the following parameters - url, noise file, output file
- Read the page at the url
- Parse the page, remove tags and write all the text into a temporary file
- Invoke Word Frequency Analyzer2 with the temporary file, noise file and outputfile
The Skills you need for this project
- Parsing html and extracting page content
- dictionaries/hashs/maps usage to store noise words and count frequency of occurence of input words
- sorting
Same as Word Frequency Analyzer -3 with the following differences:
1. Instead of outputting a frequency table, out put a tag cloud (What is a TagCloud?)
2. Use an existing tag cloud library in your favorite language (a list of links are in the TagCloud page)
3. Typically tag cloud generators create an html file. Try to display it in the browser manually first
4. Try to invoke a browser and pass it the tagcloud html file you generated (so when we run the program, a browser will be activated and the tag cloud displayed)
WFA5:
Posted in Uncategorized on September 21st, 2008 by Dorai | |

on September 22nd, 2008 at 11:26 am
Amazing Site!I have got lot of information by going through your site. .Thanks!!