The goal of this programming assignment is to gain experience with recursive binary search, reading and processing ascii files, and object oriented design. Your task is to write a program that reads a document written in English, and calculates the estimated reading level of the document based on the rarity of words used. Documents that use very common words will have a lower reading level than documents that use very rare words.
You will be given a file with the top 1000 most commonly used English words based on frequency analysis of a collection of novels. Each line in the file consists of an integer rank (1 = most common, 1000 = least common) followed by the word. To simplify your analysis all of the upper case letters have been converted to lower case (e.g. Mr converted to mr). Here is a link to the file with words sorted according to rank: rank.txt. Here is a link to the file with the words sorted alphabetically: words.txt.
To estimate the reading level of a document, you must read the words one at a time, look each word up in your dictionary to see what the rank of the word is, and then calculate the average word rank for the document. For example, consider the first paragraph of "Cat in the Hat" by Dr. Seuss.
The sun did not shine. It was too wet to play. So we sat in the house All that cold, cold, wet day.
The corresponding ranks for these words are:
1 672 77 20 1001 11 8 106 1001 3 617 33 54 273 7 1 133 35 10 463 463 1001 123
Notice that we have two words "shine" and "wet" that did not occur in our list of most common 1000 words, so we gave them an artificial rank of 1001. The average rank of these 23 words is 265.78.
In addition to the word rank files above, you will be given five samples of text taken from the beginnings of well known public domian books. You can use these short documents to calculate average word ranks. Hopefully the childrens books will have lower reading levels than the classic novels.
There are two essential problems you need to solve in order to complete your program. First, you need some way to read and store all 1000 words and their corresponding ranks, and some way to search this data structure to look up a word and find its rank. The most natural way to do this would be to create a "Dictionary" class that has a "read_file" method to load words and their ranks into private arrays, and a "binary_search" method to recursively search these arrays to look up the rank of a word. To test and debug this class, you can write a tiny main program that prompts the user for words, and prints out their corresponding ranks. Hint: See the program "dictionary.cpp" and "numbers2.cpp" in the source directory for some sample code.
The second problem you must address is the reading and processing of the input document. At one level, this is relatively easy because you just need to read an input file one word at a time until the end of file is reached, and look up each word using your Dictionary class above to calculate the average ranks. The tricky part of this process is dealing with upper case letters, numbers, and other characters in the input file. You need to convert A..Z to a..z and remove all other characters except the single quote ' from your words before you look them up in your dictionary. For example, you need to convert "The" to "the" and "cold," to "cold" to process the sample text above.
You can implement this program using either a bottom-up approach or a top-down approach. If you go for a bottom-up approach, start by creating basic methods and classes, and test theses methods using a simple main program that calls each method. When this is working, you can create the main program that uses these methods to solve the problem above.
If you go for a top-down approach, start by creating your main program that reads user input, and calls empty methods to pretend to solve the problem. Then add in the code for these methods one at a time. This way, you will get an idea of how the whole program will work before you dive into the details of implementing each method and class.
Regardless of which technique you choose to use, you should always develop your code incrementally adding code, compiling, debugging, a little bit at a time. This way, you always have a program that "does something" even if it is not complete. Also, remember to keep backup versions of your code, so you always have a version to fall back on if you accidentally delete your code or your changes do not work out.
Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. You are required to include your testing results in your project report to demonstrate that your program works correctly. To do this use the "script" command to save all program input/output in a "typescript" file, and cut and paste from this ascii file into your program documentation.
When you have completed your C++ program, use the "Programming Report Template" on the class website to document your programming project. This report has separate sections to describe the problem statement, your design decisions, your implementation process, and your testing results. Each of these sections should be 1-2 paragraphs long, so your completed report will be 2-3 pages long once you have included your testing output.
In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "Upload" your documentation (a single docx file), and your C++ program (a single ascii text file).
The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit:
You will receive partial credit for all programs that compile even if they do not meet all program requirements, so handing projects in on time is highly recommended.