CS 553 Homework 1


Overview:

In this homework, you will build a simple plargarism detector using Google's web service API. You detector will take a text file as input and output a frequency sorted list of URL's matching phrases in the text file.

Due date:

Emailed to the TA (See below) by Midnight, Friday Feb 6th.

Building the Detector:

Go to Google’s Web Service API site.

Download the developer's kit, and create an account.  Follow the instructions in the developer's kit. Try running the sample client program. if you get this program to work, you're 50% done.
 
Implement a client program “Detector.java” if you use Java or "Dectector.pl" if you use perl.

Takes as input a web-service key as the 1st argument, a text file as the 2nd argument.
E.g. if you use java, to compile and run your program should go something like this:

javac Detector.java 
java Detector 0000000000000000000000 sonnet.txt
For perl, your program will work as above, but skip the compilation step.

General Algorithm

Break file into chunks of 10-word phrases. Each "word" is a string separated by whitespace. You can use the Java StringTokenizer class for this.
Search Google for each phrase, (not just the words in the phrase), up to 500 phrases maximum (5000 words). 

What to output:

  1. Total number of unique phrases
  2. Either: A list of up to 25 entries of <count, URL> pairs, where each pair is computed as:
Or, the message “No matches found”

What to hand in:

email to fzeng@cs.rutgers.edu the source code to your program. Make sure you name is in the source code.