The goal of this assignment is to develop a program to read and process an ascii file to find the top 500 most frequently used words in the document. We will be working with the full text of two famous books: David Copperfield by Charles Dickens and Robinson Crusoe by Daniel Defoe, which have been downloaded from www.gutenberg.org for educational purposes.
There are many ways to accomplish this goal, but we are going to use an accumulator approach to count the number of times each unique word occurs as we scan the ascii file. You will start with an empty hash table. Then you will read a word from the ascii file, and look this word up in the hash table. If the word is found in the hash table, you increment the associated word counter. If not, you add the word to the hash table with a word counter initialized to one. When you are finished processing the document, the hash table will contain N unique words and their associated word counters.
The next step in the process is to identify the top 500 most frequently used words by finding the words with the largest word counters. First, you need to extract all of the words and their associated counts from the hash table. Then you need to select the top 500 from this list. You can do this by sorting the (word, count) pairs in descending order based on count. This can be easily accomplished using the Linux "sort -n -r" command. You can select the most frequent words using the Linux "head -500" command.
For this assignment, students may work in groups of two to complete this programming project. If students choose this option, they must complete the tasks above, plus they must do the following:
Much of the design for this assignment has been specified above. The major decisions that remain are: (1) how to process the ascii input file to identify words, (2) how to adapt the existing hash table code to support (word, count) pairs, and (3) how to extract all of the (word, count) pairs from the hash table.
For this assignment, students are welcome to use and adapt any of the source code in the class directory. There are three sample implementations of the hash table class. Two that support integer (key, value) pairs, and one that supports string (key, value) pairs. You will need a hybrid, where key=string, value=integer.
The hash in "hash2.cpp" is OK for very small collections of strings, but it will result in a lot of collisions if 100's of strings are stored in the hash table. Hence, you will probably need to modify this hash function to "spread out" the values more. You can use any sequence of arithmetic operations you want for this purpose.
In order to make use of the Linux "sort -n -r" command, you will need to output the (word, count) data into an ascii file with one pair per line. Then you can call sort and "head -500" from the command line (or using system commands within your program).
Test your program by calculating the top 500 words for both books above. Hopefully the lists will be very similar, but we can expect some differences. What is the first word that is different? Save the top 50 words for both books for your printed report, but send in the top 500 in your online submission.
Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. Save your testing output in text files for submission on the program due date.
When you have completed your C++ program, write a short report (less than one page long) describing what the objectives were, what you did, and the status of the program. Does it work properly for all test cases? Are there any known problems? Save this report in a separate text file to be submitted electronically.
In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "submit" your documentation, C++ program, and testing files.
The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit (50% off if less than 24 hours late, no credit if more than 24 hours late), so hand in your best effort on the due date.
You should also PRINT a copy of these files and hand them into your teaching assistant in your next lab. Include a title page which has your name and uaid, and attach your hand written design notes from above.