The goal of this assignment is to develop a program to read and process an ascii file to find the top 50 most frequently used words in the document. We will be working with the full text of two famous books: David Copperfield by Charles Dickens and Robinson Crusoe by Daniel Defoe, which have been downloaded from www.gutenberg.org for educational purposes.
There are many ways to accomplish this goal, but we are going to use a binary search tree to count the number of times each unique word occurs as we scan the ascii file. You will start with an empty binary search tree. Then you will read a word from the ascii file, and insert this word into a binary search tree. If the word is already found in the tree, you increment the corresponding word counter instead of adding a duplicate node. If the word is not found in the binary search tree, you insert a new node into the tree with a word counter set to one. When you are finished processing the document, the binary search tree will contain N unique words and their associated word counters.
To find the top 50 most frequently used words in the the document, you need to print the contents of the binary search tree with each "count word" pair on a separate line in a text file. Then you can use the unix "sort -nr" command to sort this text file with the largest values at the top of the file.
There are several key design issues that must be addressed in this project: (1) how to read the text file and extract the individual words, (2) how to adapt the binary search tree to contain "count word" pairs, and (3) how to extract and sort "count word" pairs to identify the top 50 most frequently used words in the document.
Students are welcome to modify and adapt the current binary search tree implementation tree.h and tree.cpp for this assignment. This BST currently stores only integer values, so the node needs to be modified to contain a string "word" and an integer "count", and the insert, and search methods need to be modified accordingly. You can delete the tree sort operations if you wish.
You can implement this program using either a bottom-up approach or a top-down approach. If you go for a bottom-up approach, start by creating basic methods and classes, and test theses methods using a simple main program that calls each method. When this is working, you can create the main program that uses these methods to solve the problem above.
If you go for a top-down approach, start by creating your main program that reads user input, and calls empty methods to pretend to solve the problem. Then add in the code for these methods one at a time. This way, you will get an idea of how the whole program will work before you dive into the details of implementing each method and class.
Regardless of which technique you choose to use, you should develop your code incrementally adding code, compiling, debugging, a little bit at a time. This way, you always have a program that "does something" even if it is not complete.
Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. Try your program on 2-3 input documents, and save your testing output in text files for submission on the program due date.
When you have completed your C++ program, write a short report (less than one page long) describing what the objectives were, what you did, and the status of the program. Does it work properly for all test cases? Are there any known problems? Save this report in a separate text file to be submitted electronically.
In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "submit" your documentation, C++ program, and testing files.
The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit (50% off if less than 24 hours late, no credit if more than 24 hours late), so hand in your best effort on the due date.
You should also PRINT a copy of these files and hand them into your teaching assistant within ONE DAY of the due date. This can be done during class, during lab, or using the department mailbox. Attach a copy of the Programming Project Evaluation Form to your report with your name and project number filled in.