The goal of this assignment is to develop a program to read and process an ascii file to find the top 50 most frequently used words in the document. We will be working with the full text of two famous books: David Copperfield by Charles Dickens and Robinson Crusoe by Daniel Defoe, which have been downloaded from www.gutenberg.org for educational purposes.
There are many ways to accomplish this goal, but we are going to use a hash table to count the number of times each unique word occurs as we scan the ascii file. You will start with an empty hash table. Then you will read a word from the ascii file, and insert this word into the hash table. If the word is already found in the hash table, you increment the corresponding word counter instead of adding a duplicate entry. If the word is not found in the hash table, you insert a new entry into the hash table with a word counter set to one. When you are finished processing the document, the hash table will contain N unique words and their associated word counters.
To find the top 50 most frequently used words in the the document, your program must print the contents of the hash table out to an ascii file called "words.txt" with each "count word" pair on a separate line. Then you can use the Linux "sort" and "head" commands to sort the hash words from the hash table in descending order, and select the top 50 words from the sorted list. The Linux commands to use are:Note: The "<" and ">" characters in Linux are used for file redirection. In the case above, the cin commands in hw5.exe will read "copperfield.txt" and the cout commands will write to the file "words.txt". Similarly, the Linux sort command will read from "words.txt" and write its sorted output to "words.sort". The commands above should work fine on Mac OS, but will not work on a Windows PC unless you have cygwin installed.
There are several key design issues that must be addressed in this project: (1) how to read the text file and extract the individual words, (2) how to store and access "count word" pairs in the hash table, and (3) how to extract and sort "count word" pairs to identify the top 50 most frequently used words in the document.
For this assignment, students are required to use separate chaining to implement the hash table. This approach uses linked lists to deal with hashing collisions, so students are welcome to use or modify "list.cpp" or any of the other linked list implementations in the src directory. Students are also encouraged to look at the "word_hash.cpp" program for examples of file I/O. The sample hash table in "hash.cpp" illustrates the typical API for a hash table, and shows how linear probing can be used to handle collisions.
You can implement this program using either a bottom-up approach or a top-down approach. If you go for a bottom-up approach, start by creating basic methods and classes, and test theses methods using a simple main program that calls each method. When this is working, you can create the main program that uses these methods to solve the problem above.
If you go for a top-down approach, start by creating your main program that reads user input, and calls empty methods to pretend to solve the problem. Then add in the code for these methods one at a time. This way, you will get an idea of how the whole program will work before you dive into the details of implementing each method and class.
Regardless of which technique you choose to use, you should always develop your code incrementally adding code, compiling, debugging, a little bit at a time. This way, you always have a program that "does something" even if it is not complete. Also, remember to keep backup versions of your code, so you always have a version to fall back on if you accidentally delete your code or your changes do not work out.
Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. You are required to include your testing results in your project report to demonstrate that your program works correctly. To do this use the "script" command to save all program input/output in a "typescript" file, and cut and paste from this ascii file into your program documentation.
When you have completed your C++ program, use the "Programming Report Template" on the class website to document your programming project. This report has separate sections to describe the problem statement, your design decisions, your implementation process, and your testing results. Each of these sections should be 1-2 paragraphs long, so your completed report will be 2-3 pages long once you have included your testing output.
In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "Upload" your documentation (a single docx file), and your C++ program (a single ascii text file).
The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit:
You will receive partial credit for all programs that compile even if they do not meet all program requirements, so handing projects in on time is highly recommended.