Implementing tool for plagiarism detection


Marija Sljivovic

Student e-mail:

maka82 AT gmail DOT com 

Student Major:

Computer Science; Mathematics

Student Degree:


Student Graduation:

Winter 2009


Apache RAT


This document is about Apache RAT cut&paste detector: http://wiki.apache.org/general/SummerOfCode2009#rat-project
There are explained some ideas discussed on rat-dev mailing list.


My aim is to create working, modular, configurable command-line tool for searching the web based code search engines for possible plagiarized code in our code bases.

Tool will be heuristic in nature. It will make guesses about code parts. If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines. Any match found will be copied to a report. Man who read this report will decide about if the code is really copied or not.

Command line parsing

We decide this tool have to be a command-line tool.
All input parameters will be sent to program with command-line arguments
or will be provided with configuration(xml or plaintext) file.
This config-file can be loaded automatically(like Ant does)
or by specifying it like command line argument.
For now I think about this parameters:

search engines on which source code will be checked. For this i think that argument can be class name of ISearchEngine implementation I explained it previous.

I think that it is enough 5 days in the beginning of coding(I newer code parser for command line arguments before, but I think I can use RAT's)

Design of plagiarism module interface

Sliding window module

Basic example of this algorithm I already coded. Lets talk about two version of this algorithm:

It will work like previous but with different input(source file from which we removed all getter's, setters and other code chunks we are not interested in) , and will use list off different "good-to-be-copied-code" recognition methods. This methods are not part of this algorithm and will be probably separated by an interface. This heuristic methods will be codded separate of this task and will get more time then rest of code. Because of that I think to write only several of them in the beginning and rest of on the end of timeframe. Main difference between this two algorithms will be that second will rarely call search engines. For writing these two algorithms and implementation of few basic heuristic methods(only for one language) I think that is enough 10 days


This algorithms will call ISearchEngine (or more of them). I found that some of code search engines cannot get correct result if we send them large number of tokens. This is quite unexplored lend to me... For writing these parsers I hope that will take me another 10 days.

After this first version will be completed. Then I plan to look back to see if design of application is good or I must change something.


This project have to be wall-tested. Because of that I will write tests for every heuristic method. It is good practice to write tests after each functionality is finished(or even before it :) ) I think to use JUnit Test source will be loaded from files that exists in code engines and from files which I manually checked to be unique. Probably I will have test source code for whole algorithm and for each of heuristic methods. When (probably and in some occasions before ) tool pass all my test , I will do real testing on search engines. Then I will do same thing - use non plagiarised code I wrote and code I got from source-repositories which are indexed by search engines and run this tests on real. Writing this tests for small number of heuristic methods will take huge amount of time, but I think that this will enlarge time for 10-15 days. Because of my poor Internet connection, I plan to make my own search engine on localhost and tray this tool on it.(I think I will use Struts2 and Apache Lucene if it is easy to use...)


This is aproximative time-line for this project:

Some of tasks can be merged with testing. I hope that there will no need for code refactoring. If this estimation is correct, I hope to have working tool in first half of July.


Here you can see very basic prototipe for tool which I plan to make.

This simple demo tool checks is source code is cut&pasted from somewhere by searching on
Google code search, Koders code search....
Tool is still unfinished with very basic usability.....


Additional Information:

I took a part in Gsoc2008 in project "JSP demo for ICU4J ” which I successfully completed. Since then I develop several tiny tools and using knowledge I got in this project I am developping Russian-Serbian dictionary with two views - web based dictionary and desktop java application.

Additional ideas:

I found that Google Code search have an API for client side code checking:


I found several thing about this api. This api have Apache Licence which means that we can use it in this project. I downloaded this libraries and make several simple test cases. It is easy to use and have great performance. What is more important, this api have almost all functionality I we want to GoogleCodeSearch api to have.

By using it in I found that it is very powerful if it is used with regular expressions. One of greatest problems I found earlier with Google Code search is that long queries are not parsed properly by Google Code search. Using this api with regular expressions I successfully retrieved some multi-word queries. It supports filtrating by language, licence, file name, package...

Progress reports

BruteforceHeuristicChecker is added. It will be used if user wants to check all code. If used, it override all other heuristic checkers. JavaCommentHeuristicChecker and PascalCommentHeuristicChecker are finished, together with basic junit tests. Class which converts source code to regular expressions query for google code search is added. Tests for it are added too. Now, tool can check correctly small parts of code(150 tokens is maximum) It is noticed that for number of tokens greater then 150 google code search stops to respond properly...It will be a problem. It happened when parser which use raw http GET query is used. I hope that this problem will be solved if gdata-codesearch API is used. I will tray it. Temporally svn is created. There will be most fresher version of source. url: http://code.google.com/p/apache-rat-pd/

GoogleCodesearch engine is switched from using html parser to using grata-codesearch API. Because gdata api is not mavenized, adding it to local maven repository must be done manually. Script files that do that task are written and can be found on http://code.google.com/p/apache-rat-pd/ in download section.

New class PdCommandLine is created together with tests for it. Source can be found on svn: http://code.google.com/p/apache-rat-pd/ This class can parse arguments from command line and store them in properties If some arguments are not provided, it sets default values to them. Only required property is codebase. It can parse arguments stored in configuration file,too. Default configuration file name is config.conf and it is assume that it is in the same directory with apache-rat-pd and tool always try to read it.
Test are completed, except one for reading file.

Arguments passed to PdCommandLine are parsed and values are set to proper class member. After that configuration file is parsed. If path to configuration file is provided by command line argument, we try to parse that configuration file. Otherwise, default configuration "config.conf" file is parsed. After this two steps if any of members is not initialized, we initialize them with default values.
apache-rat-pd now uses this way for parsing command line arguments.

List of all avaible options and usage:

Usage is: java -jar apache-rat-pd.jar [options]


Example of config file:'

-cb c:/src
-e GoogleCodeSearchParser,myParser
-heur BruteForceHeuristicChecker
-limit 75
-r myreport.txt

New dependency is added to project. It depends from common-cli-1.1.jar. pom.xml has been changed. Manifest has been changed too. New test cases for parsing configuration file are added. Validation of arguments is now done using org.apache.commons.lang.Validate class. Code is more readable now. More tests for parsing arguments from configuration file are written. Support for multiple search engines checking is added to core algorithm. Tests for sliding window algorithm are updated.

Code is refactored to be more modular.There is no new functionality added. Sliding window algorithm is moved to separate class - SourceCodeAnalyser. It is injected from main function. So, it can easy be changed with another implementation. Now reports are generated in separate class - ReportDocument and it is injected, too. For reading of source file is used way which use Apache-RAT 0.6. Generally speaking, code is now more modular, dependency injection approach is used on all place it can be used. All project architecture is similar to Apache-RAT's. Tests are updated. No one functionality is added.

I think that logic for retry and waiting does not have to be in sliding window algorithm. This algorithm gets information from parser objects and only parser objects can know when retrieval is failed or when it is time to wait. So, this type of logic is in parser for now.

RetryManager class is added. It is here to hide retry and timeout logic. This logic protects application from exceptions caused by network problems or late response from server. Waiting between two queries can avoid DDOS alarm on SearchEngine. This part of code is in separate class because it will probably be common for all SearchEngine parsers.

Heuristic checkers for comments in Java, JavaScript, C++ and C# have been written. Since comments in them have the same syntax the same tests used for Java can be used for others. Because checking is done on similar way they all extend the same class. It will be considered for a code not to be a valid comment if there are some characters before comment or after comment.

Package structure is changed little.

By the list of most popular programming languages I choose some most popular of them to do heuristic checkers for them(http://langpop.com/): C, Java ,C++ ,PHP,C# ,Perl ,Python ,Ruby, JavaScript ,Delphi, Pascal, SQL, Fortran, Visual Basic, Lisp ,Cobol, Actionscript, ColdFusion, Shell, HTML

Plan is to cover first them by heuristic checkers and then eventualy add new to list. For now, CommentCheckers are written for this languages. In all situations, when matching comment, it is considered that it is best to match comment even it is not written on right way(HTML comment for example) then to miss one which is comment(Different implementations of same programming language compiler can have different way of commenting). That is a reason because I have written regular expressions more general. When I am not quite sure about way of writing comments in particular programming languages, I point out it in //TODO. Tests for all heuristic checkers are written, too.

Project pom.xml file is changed in way that it can now package build and resolve the dependencies (excluding transitive dependencies) from the repository and place a copy of it in the build location.

Hew heuristic checkers for recognising functions are written for this languages ActionScript, CFunction, C++, C#,Delphi, Java, JavaScript, Pascal, PHP. Like in comments, here are in use more general regular expressions for functions matching(Function heuristic checker must recognize function , but can not say if it is correctly written)

Also misspelling checker is written. This misspellings checker use English dictionary for decision if some word is correctly written. Misspellings checker can decompose combined words and check it then. Decomposition can be done by CamelCase notation or if underscore sign is used. So far, I use simple dictionary class but plan is to get dictionary dynamically during build process from the Internet.

Most of the time was spend on searching informations about use regular expressions in Google Search Code engine. There is a little resources but I found out some things. I made some simple test - cases for quering GoogleCodesearch. From generated logs I found some answers. Very good thing is that we got feed back from Google Search Code discussion group where many solutions for my problems were found out. More information about that can be found here: http://groups.google.com/group/Google-Code-Search/browse_thread/thread/2064067f869ddc71?hl=en

Basic idea is that GoogleCodeSearch cannot return good enough results if regular expression is not written in way to match whole lines. Even then lot of results are only partially matched with improper order. Because of that, here is written another algorithm for analysing degree of similarity between returned results code and source code directly in this program.

For now, in use is Levenshtein's distance algorithm and checking if results match proper regular expression. Distance algorithm is adjustable by setting similarity factor.

For better results, more research must be done, but possible improvements can be done if I can process source code files returned from GoogleCodesearch. GoogleCodeSearch cannot support this, but it can be done using third party libraries for HTML parsing like HTMLUnit.

Potential problem is that query length cannot be longer then 1024 characters. But it can be fixed by splitting code in smaller parts and query GoogleCodeSearch engine more times. More tests are written. Old bugs in regex generating are fixed using knowledge from my research about GoogleCodeSearch.

Now we can print current progress in percentage form.

Heuristic checkers for functions have been written for following languages - Fortran and VisualBasic.Decomposer of words is improved and some changes have been done on previous commit function heuristic checkers - CFunctionHeuristicChecker, CPPFunctionHeuristicChecker,DelphiFunctionHeuristicChecker, JavaFunctionHeuristicChecker and PascalFunctionHeuristicChecker. New tests have been written and for C and C++ have been added regular expressions for macro and inline functions.

While testing my application very serious bug was found. I did not give attention to URL encoding when using gdata-codesearch api , so when special characters are in URL, query is giving wrong results. Now, when it is fixed, there is no difference between results retrieved from GoogleCodeSearch site and from gdata-codesearch api. Syntax highlighting of retrieved results works good, too. :)

Also, I have notice that it has some bugs when for heuristic checker is set some of commentHeuristicCheckers. It recognizes only first comment. To fix this I have added one new class HeuristicCheckerResult. This class gives informations about what was found by certain heuristic checker and adjusts operating sliding window algorithm so that he can find functions and comments which do not begin from the start of a string.

Also, concept suggestedCodeForChecking was establish. It will be assigned by every heuristic checker. This way, heuristic checker can give better suggestion of a code to be checked on search engine. For example: for misspellings it will not be necessary to check all code but only misspelled words.

I have also notices bug that for some strings use of some regular expressions can hung application. The problem is caused by this regular expression : (\\s|.*)* I found some information about this problem and I will continue my research how to fix this problem using regular expressions like this. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4771934

Now it is possible to pause checking process and resume after that just by pressing enter. That way,user can pause execution of an actions for as long as it is needed for search engine to start responding to the queries again.

A solution for preventing endless loop mentioned before is found. More informations about this solution can be found here : http://stackoverflow.com/questions/851057/how-to-prevent-regular-expression-of-hang-or-set-time-out-for-it-in-net/859074

Bug causing that application newer really ends is fixed.Support for generating queries with language information is added. Information about language is used based on which heuristic checker is used -JavaFunctionHeuristicChecker will cause queries to be generated for Java language and so on...

GoogleCodeSearchParser is improved. It can split very long queries and now error "query too long" is not happening any more. Tests are updated.

I have tested new version of application repeating tests and compared the results with previous versions of application. Many improvements have been done (better recognises text suspected to be plagiarised, some regular expressions have been simplified )and results are much better. I have noticed that the results are hard to read and that there must be something done about format of a report.

Support to define language of source code from command line is added. If language is not defined from command line option, it can be retrieved from current heuristic checker. Verbose option in command line is added. It manage how much information will be displayed in console in run-time. Tests are, again, updated.

For some time I am working on a implementation of Koders search parser. Aim is to make working and maintenanceable version of Koders engine parser. It must have ability to load koders.com webpage, to enter query and parse retrieved result page - and to extract useful information from it. Sometimes it can be more then one page with results to check. This code must be easy to maintenance and change(if site is changed). Same thing must be done with Krugle code search [2] With Krugle there is another option called "advanced search". It can be used for large code part search. With Google Code search it is easy because google have library to access that service.

After research I found a library which can provide us ability to access this sites. This tool is HTMLUnit. [3] It is "GUI-Less browser for Java programs". It provide api to access any interesting information on web page even if it have a lot of javascript.

There is no other library which can parse GWT (GoogleCodeSearch) and other java-script pages with this amount of a success. HTMLUnit is licensed by Apache2.0 license. It is already mavenised. Only disadvantage of using this library in our code is a lot of project dependencies and it's name, but even if it is mainly used for testing, it can be used very well to retrieve information from web, too. So, I believe that using this library will help us to work with all three parsers in common way.

Started discussing about code with one of my mentors.

[1] http://www.koders.com/

[2] http://www.krugle.com/

[3] http://htmlunit.sourceforge.net/

Initial version of Koders code search parser is written. With it I already can parse koders code search result page and read code from GoogleCodeSearch (GWT is supported )which can not regularly be retrieved by gdata-codesearch api. Gdata-codesearch api does not have support to retrieve language of search result but using HTMLUnit it is possible. I give sample code to community to discuss about this approach for working with code search engines different from Google Code Search. On apache-rat-pd project page is sample of using HTMLUnit to parse Koders code search engine. [4]

I have been working on a reports. Now we use search result class which consist all data relevant to one search result.

Report is in basic list of search results. This change allows us to make more verbose reports with different formats. So far reports can be exported as html, txt or xml files. In addition, it allows us to add more searchEngineParser class without changes in core of project.Selection of format can be done by -report command line option( "-report myReport.html" for example) Support for choosing proper report format from command line is added, too.

Now, from GoogleCodeSearch engine we can get license, class name, package, project, owner... information.

Some of regular expressions in heuristic checkers are corrected again. That regular expression caused sometimes hang of

application and stack overflow exception because of backtracking problem. I tested application and found bug - when in code is "$" sign, stringToRegex function breaks. More JUnit tests are added.

[4] http://code.google.com/p/apache-rat-pd/downloads/list

MarijaSljivovic/SoC2009ApacheRatProposal (last edited 2009-09-20 23:36:15 by localhost)