GSoc 2015 – Project Proposal for “NUTCH-1936 - GSoC 2015 - Move Nutch to Hadoop 2.X”

Name: Mohit Bagde (Email: <bagde AT usc DOT edu>)

University: University of Southern California

Mentor Name: Lewis John McGibbney

Abstract: The main goal of this project is the porting of the entire Apache Trunk codebase to the new Hadoop 2.X API [1]. It will involve the complete overhaul of all the main jobs involved in a Nutch crawler viz. Injecting, Fetching, Parsing, De-duplication and Indexing to the new Hadoop MRv2 API from the current Hadoop 1.X codebase.


Motivation for the Project

My name is Mohit Bagde. I am a Graduate student currently pursuing my Master’s degree in Computer Science at the University of Southern California. I am strongly self-motivated because of my interest in this field and to do valuable, guided research in Data Informatics, Information Retrieval and Database Systems. I am aware that R group and Google expect very high standards from its students. On my part, I can assure you of hard work and consistency. I believe that my enthusiasm will enable me to meet those expectations.

In my current semester, I have taken CS572 Information Retrieval and Search Engines under Prof. Chris Mattmann and have worked on Nutch 1.X [1] as part of the first assignment which involved crawling with Nutch and integrating with Tika and subsequently developing a plugin in Nutch. I have also taken INF 550 under Prof. Seon Kim where I have written programs in HDFS using Map Reduce and I find that both these subjects have a common point in the JIRA issue NUTCH-1936 [2] which is about porting Nutch to Hadoop 2.X. I enjoyed working with Nutch and found the entire experience to be very knowledgeable. I would like to continue to develop and contribute to Nutch in any which way possible.

Apache Nutch – About and Workflow

Apache Nutch is a highly scalable and robust web crawler that is also extremely polite and obeys the rules of robots.txt file for the websites that it crawls [3]. Nutch, developed by Doug Cutting (who also created Lucene and Hadoop), now has two separate codebases namely the 1.X and 2.X. Although Nutch is written in Java, it makes use of various “plugin” like modules that allow developers to implement their own parsers, deduplication algorithms and indexer interfaces.

Apache Hadoop was an Incubator sub-project [4] that was derived from Apache Nutch as Nutch required significant processing power to perform multi-machine web crawling and indexing. This came about in the form of MapReduce tasks and the HDFS system. Nutch runs on a Hadoop cluster that scales well to the order of ~ 100 machines. However, a user can run the Nutch configuration on a local machine by configuring Hadoop to run in standalone or Pseudo-distributed mode [5] and thus achieve a comparably sized processing power as that of running Hadoop over a cluster of machines.

The primary advantage of Apache Nutch lies in its customizable pluggable interfaces or “plugins” [6] as they are termed. Below is a fairly simple architecture of how Nutch performs its crawling and indexing.

Figure 1 – Apache Nutch Flowchart [7]

Having done a few simple crawls, the overall process is as follows:

1. Initially, we create an empty directory containing the files that will form the seed URLs. (URL files). Then we run the inject command to inject the seed URL into the crawlDB. The crawlDB stores meta-data about the crawled URLs. (URL DB)

2. Then, the generate command is run to create segments that contain a list of URLs that have been successfully crawled. (db_success flag will be set for these files in the crawlDB) (SEGMENT)

3. The fetcher command will acquire the content of those URLs on the fetchlist and store it in the segment directory created by generate. This step takes anywhere between a few minutes (2-3 rounds of crawling) to days (30-40 rounds of crawling) depending on the number of rounds that the crawler is run for. Crawls can be optimized by modifying certain parameters in the nutch-site.xml that overrides the nutch-default config. [8] (CONTENT)

4. Nutch is integrated with Apache Tika which is a general framework of parsers [9] to extract the content and resulting metadata from the URLs that have been fetched. Subsequently, the parse command is run to parse the content from the websites that have been fetched. It can also be run to update the content of the crawlDB during a re-crawl. (Nutch re-crawling can be done to update the crawlDB to incorporate changes made to data of the crawled URLs). Also, HTMLParser removes all <html> tags from the documents that are fetched. (PARSED_DATA)

5. Finally, before indexing the parsed_data with Apache Solr, Nutch will perform an inversion of the links. This is due to the following paradigm that “it is not of interest to account for the number of outgoing links, instead, we should account for the number of inbound links”. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the linkdb. The linkdb also stores information about all links known to each URL fetched by Nutch. (INVERTLINK DB)

6. The last step is optional in Nutch as of version 1.10 trunk. One can index the final 6 directories that are created (as shown in the directory structure). The SolrIndex can be used along with Lucene’s library. And after having installed Solr [10], one can go to the localhost:8983/solr/admin (if on jetty) and query on the data that they have crawled via Nutch.

Apache Hadoop – 1.X vs 2.X Issues

The current version of Apache Hadoop is 2.6. At the time of implementation of Apache Nutch, however it was running Hadoop 1.X. The main goal of this project is to migrate Hadoop 1.X to 2.X. Before moving on to how we migrate these changes in Nutch, we must first understand what the key differences are between 1.X and 2.X and what changes must be incorporated to ensure that their both binary and source compatibility between the two version for the Nutch trunk.


Hadoop 1.X

Hadoop 2.X

Number of nodes

~4,000 nodes per cluster

~10,000 nodes per cluster

Running Time

O(#nodes in cluster)

O(cluster size)

Namespace Config

Only 1 namespace node

Multiple namespaces for managing HDFS

Application support

Only able to run Map and reduce jobs, that are static

Able to run any java apps that can integrate with Hadoop


Bottleneck lies in the JobTracker for both resource management and taskTracker task scheduling

Uses YARN (Yet Another Resource Negotiator) to perform effective cluster management

Although this table does not highlight all the differences between the two codebases, it is a good start to start exploring what changes must be made to Apache Nutch’s tasks to port it to 2.X. In Apache Hadoop 2.x the part that deals with resource management capabilities has been placed into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) and it remains as a pure distributed computation framework.

So the crux of the project would be to ensure binary and source compatibility of the applications that use old mapred APIs in Nutch. For the case of binary compatibility, this means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. However, we cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup.

In general, MRv2 is able to ensure satisfactory compatibility with MRv1 applications. However, due to some improvements and code re-factorings, a few APIs have been rendered backward-incompatible. The first quarter of the project will deal with identifying what APIs are being utilized by Nutch that have been rendered backward incompatible and discussing these issues as part of the NUTCH-1219 JIRA ticket [12]. I have already begun to look into this and will continue to do so as the project proceeds.

Project Plan

I believe that the project can be completed in a total of 4 quarters with each quarter being an increment over its predecessor. I propose the following plan for completing the project along with a timeline attached for each quarter, including deadlines and report submissions.






Initial Project Proposal draft deadline


03/28/2015 to 04/15/2015

Work done during this phase will mostly be oriented towards studying the documentation of the Apache Nutch and Hadoop. I will also start with drafting the first report at this time. This phase would be a fairly long as I will have to read, understand and discuss the documentation with my mentor on a regular basis. Various issues like 2.X binary and source compatibilities, YARN configurations, identification of mapred APIs issues in Nutch trunk to be addressed [13].


04/16/2015 to 04/23/2015

In this short period of time, I will work on crawling with Nutch and identifying potential incompatibilities when it is deployed over Hadoop. I have already used Nutch as part of my coursework in college and I already know how to crawl, inject, fetch, parse and index documents. I also have a Cloudera Distribution [14] (CDH) of Hadoop and HBase along with other HDFS framework technologies (like Sqoop and Flume) and can easily test the Map and Reduce tasks that are being used in Nutch at this point.


04/24/2015 to 05/01/2015

Study break as I have my final exams on 30th April and 1st May and will require some time to prepare accordingly for it.


05/02/2015 to 05/22/2015

I will submit a rough draft of the first report to the mentor and discuss issues that have been solved and issues that have not yet been tackled. Additional reading and resources that have to be understood and related work done on this area will be addressed during this time frame. Also, will begin work on trying to resolve JIRA issue NUTCH-1936 and 1219.



Deadline for the submission of First report


05/24/2015 to 06/10/2015

Coding work to begin on the issues handled, discussed with mentor in the Quarter 1 phase of the project. Will need to keep track of any new bugs and fixes that pop up during this phase. Second report rough draft to be completed.


06/11/2015 to 06/20/2015

First half of coding work completion deadline and discussion of second report with project mentor. Also, NUTCH-1219 to be resolved by the end of this period.



Submission of Second report incorporating NUTCH-1219 resolution


06/22/2015 to 07/02/2015

Remainder of coding work and issues in second report to be completed. Discussion with mentor on handling of the Mid-term evaluation to be done.



Mid-term evaluation done by Google.


07/04/2015 to 07/23/2015

Based on the evaluation, existing work to be modified based on suggestions by mentor. Coding work second half completion deadline to be addressed as well


07/24/2015 to 08/02/2015

Testing work to be carried out on the system during this phase of implementation. Existing code must be tested to handle edge cases, exceptions and incompatibility issues addressed in the first and second reports.


08/03/2015 to 08/10/2015

Documentation to be performed for all phases step by step and discussed with mentor. FAQ section and Bug fixes to be documented as well. Final changes to be made that will increase brevity and readability.


08/11/2015 to 08/16/2015

Proof-checking and submission of Final Report deadline preparation. Final discussions with mentor for any loose ends and patches to be done to the report.


08/17/2015 to 08/20/2015

Suggested 'pencils down' date. Take a week to scrub code, write tests, improve documentation, etc.



Firm pencils down date and submission of report.
















MohitBagde (last edited 2015-03-29 18:05:04 by MohitBagde)