The goal of this assignment is to develop a program to read and process an ascii file to find the top 50 most frequently used words in the document. We will be working with the full text of two famous books: David Copperfield by Charles Dickens and Robinson Crusoe by Daniel Defoe, which have been downloaded from www.gutenberg.org for educational purposes.
There are many ways to accomplish this goal, but we are going to use a hash table to count the number of times each unique word occurs as we scan the ascii file. You will start with an empty hash table. Then you will read a word from the ascii file, and insert this word into the hash table. If the word is already found in the hash table, you increment the corresponding word counter instead of adding a duplicate entry. If the word is not found in the hash table, you insert a new entry into the hash table with a word counter set to one. When you are finished processing the document, the hash table will contain N unique words and their associated word counters.
To find the top 50 most frequently used words in the the document, you need to print the contents of the hash table out to an ascii file called "words.txt" with each "count word" pair on a separate line. If you execute the unix command "sort -nr words.txt > words.sort" you will creat an output file with the most frequent words on the top of the file. This can be done at the unix command prompt, or by calling the "system()" command inside your program.
There are several key design issues that must be addressed in this project: (1) how to read the text file and extract the individual words, (2) how to store and access "count word" pairs in the hash table, and (3) how to extract and sort "count word" pairs to identify the top 50 most frequently used words in the document.
For this assignment, students are required to use separate chaining to implement the hash table. This approach uses linked lists to deal with hashing collisions, so students are welcome to modify and adapt the any of the linked list implementations in the src directory as needed. Students are also encouraged to look at the "word_count2.cpp" program for examples of file I/O. The sample hash table in "hash.h" and "hash.cpp" illustrates the typical API for a hash table, and shows how linear probing can be used to handle collisions.
You can implement this program using either a bottom-up approach or a top-down approach. If you go for a bottom-up approach, start by creating basic methods and classes, and test theses methods using a simple main program that calls each method. When this is working, you can create the main program that uses these methods to solve the problem above.
If you go for a top-down approach, start by creating your main program that reads user input, and calls empty methods to pretend to solve the problem. Then add in the code for these methods one at a time. This way, you will get an idea of how the whole program will work before you dive into the details of implementing each method and class.
Regardless of which technique you choose to use, you should develop your code incrementally adding code, compiling, debugging, a little bit at a time. This way, you always have a program that "does something" even if it is not complete.
When you think you are about 1/2 way through the program, upload a copy of your source code and your program output at that point. Be sure to hand in something that compiles even if it does not do much when it runs.
Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. Try your program on 2-3 input documents, and save your testing output in text files for submission on the program due date.
When you have completed your C++ program, write a short report (less than one page long) describing what the objectives were, what you did, and the status of the program. Does it work properly for all test cases? Are there any known problems? Save this report in a separate text file to be submitted electronically.
In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "submit" your documentation, C++ program, and testing files.
The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit (50% off if less than 24 hours late, no credit if more than 24 hours late), so hand in your best effort on the due date.
You should also PRINT a copy of these files and hand them into your teaching assistant in your next lab. Include a title page which has your name and uaid, and attach your hand written design notes from above.