CSCE 2014 - Homework 5
Due Date - 04/08/2013 at 11:59 PM

1. Problem Statement:

I am sure you have all used google to search for documents, but have you ever wondered how google works? At a high level their company does the following: 1) visit every web page on the internet, 2) get information about the words in each web page, 3) build a giant index using this word information, 4) process user queries to find the web pages that match the query words, and 5) charge a lot of money for the targeted ads that appear on the results page.

In this assignment, we will be using file processing and sorting to get our own information about the words that appear in a web page. To do this, your program must do the following: 1) read an ASCII file and extract the words, 2) store these words in an array of strings, 3) sort the array of strings in alphabetical order, 4) process the array to count how many times each word occurs, and 5) print out the words and counts for all words that occur more than COUNT times in a web page.

For example, if the document contains the following text from the Dr. Seuss Green Eggs and Ham book:

I would not like them
here or there.
I would not like them
I do not like
green eggs and ham.
I do not like them,

After sorting, you should have the following array of words: "I I I I I Sam am and anywhere do do eggs green ham here like like like like not not not not or them them them there would would". Notice that we have removed all non-alphanumeric characters from the input. If you have COUNT=3, then the program should print out the counts and words for all words that appear three or more times. For the example above, you should print the following:

5 I
4 like
4 not
3 them

To test your program, do a google search for a topic that you are interested in, and save the web page as an html document. Then run your program on this document to see what the "top ten" most common words were, and how many times your query words appeared. Were your query words in the top ten? Try your experiment with a second query, and see if you get similar results. Include your findings in your project report.

2. Design:

Your first design task is to figure out how to read the ASCII file into an array of words. One option is to declare a string variable str and use "din >> str" to read the file. This approach will require some work to remove non-alphanumeric characters from each string. Another option would be to declare a char variable ch, and use "din.get(ch)" to read one character at a time, and build strings up one letter at a time. A third option would be to use din.getline() method to process the input file one line at a time. You are welcome to use any approach you wish.

Your second design task is to sort the array of words. Here you are allowed to use ANY sorting method described in the book, in the lectures, or the labs. You are NOT allowed to use the built-in C++ sort algorithm, the Linux command line sort program, or sorting code you find on the web. One of the goals of this assignment is for you to implement and debug a string sorting function.

Your final design task is to process the sorted array to find the words that occur more than COUNT times. This sounds easy, but it can be tricky to implement correctly. Make sure you test your code on some small test files (like the Green Eggs and Ham text above which only has 50 unique words) to be sure it is 100% correct.

3. Implementation:

You can implement this program using either a bottom-up approach or a top-down approach. If you go for a bottom-up approach, start by creating basic methods and classes, and test theses methods using a simple main program that calls each method. When this is working, you can create the main program that uses these methods to solve the problem above.

If you go for a top-down approach, start by creating your main program that reads user input, and calls empty methods to pretend to solve the problem. Then add in the code for these methods one at a time. This way, you will get an idea of how the whole program will work before you dive into the details of implementing each method and class.

Regardless of which technique you choose to use, you should develop your code incrementally adding code, compiling, debugging, a little bit at a time. This way, you always have a program that "does something" even if it is not complete.

4. Testing:

Test your program to check that it operates correctly for all of the requirements listed above. Also check for the error handling capabilities of the code. Test your program and save your testing output in text files for submission on the program due date.

5. Documentation:

When you have completed your C++ program, write a short report (less than one page long) describing what the objectives were, what you did, and the status of the program. Does it work properly for all test cases? Are there any known problems? Save this report in a separate text file to be submitted electronically.

6. Project Submission:

In this class, we will be using electronic project submission to make sure that all students hand their programming projects and labs on time, and to perform automatic analysis of all programs that are submitted. When you have completed the tasks above go to the class web site to "submit" your documentation, C++ program, and testing files.

The dates on your electronic submission will be used to verify that you met the due date above. All late projects will receive reduced credit (50% off if less than 24 hours late, no credit if more than 24 hours late), so hand in your best effort on the due date.