CS 553 Homework 1
Overview:
In this homework, you will build a simple plargarism detector using
Google's web service API. You detector will take a text file as input
and output a frequency sorted list of URL's matching phrases in the
text file.
Due date:
Emailed to the TA (See below) by Midnight, Friday Feb 6th.
Building the Detector:
Go to Google’s Web Service API
site.
Download the developer's kit, and create an account. Follow the
instructions in the developer's kit. Try running the sample client
program. if you get this program to work, you're 50% done.
Implement a client program “Detector.java” if you use Java or
"Dectector.pl" if you use perl.
Takes as input a web-service key as the 1st argument, a text file as
the 2nd argument.
E.g. if you use java, to compile and run your program should go
something like this:
javac Detector.java
java Detector 0000000000000000000000 sonnet.txt
For perl, your program will work as above, but skip the compilation
step.
General Algorithm
Break file into chunks of 10-word phrases. Each "word" is a string
separated by whitespace. You can use the Java
StringTokenizer class for
this.
Search Google for each phrase, (not just the words in the phrase), up
to 500 phrases maximum (5000 words).
What to output:
- Total number of unique phrases
- Either: A list of up to 25 entries of <count, URL> pairs,
where each pair is computed as:
- URL’s matching all phrases, reverse sorted by frequency, out of
top 10 URL’s per phrase as returned by Google.
E.g., if URL 1 matched phrases 1,4,5,6,100, and 101, this URL would get
a count of 6.
Or, the message “No matches found”
What to hand in:
email to fzeng@cs.rutgers.edu the source code to your program. Make
sure you name is in the source code.