GSoC 2017 Proposal
Omkar Reddy - omkarr [at] apache dot org
Lewis John McGibbney
Currently Apache Nutch has the concept of a WebGraph that builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop ScriptInputFormat and ScriptOutputFormat’s respectively. There are basically two scenarios to represent the graph as RDF datasets that we discuss in this proposal below.
Apache Nutch already has a variety of tools and plugins to crawl Web pages and then interpret the crawl data. The core Nutch data structures are CrawlDB, LinkDB and Segments. These data structures contain all the properties and metadata of every crawl record (CrawlDatum). Nutch also has Webgraph support, which although it does not build a graph permitting query and traversal, help to calculate ranks of links (class <LinkRank>). Our main aim in this project is to build a Webgraph which provides the ability to undertake full graph traversal and which has the ability to be queried.
Description of the project :
Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch provides different tools for web crawling which rely on Hadoop data structures. We are going to focus more on Nutch 1.x in this project. Nutch also provides plugins which are central to how Nutch works. All of the parsing, indexing and searching that Nutch does is actually accomplished by various plugins.
Apache Tinkerpop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). Apache Hadoop falls under OLAP system and as does nutch. When a data system is TinkerPop-enabled, its users are able to model their domain as a graph and analyze that graph using the Gremlin graph traversal language.
Gremlin is the graph traversal language of Apache TinkerPop. Gremlin is a functional, data-flow language that enables users to express complex traversals on (or queries of) their application's property graph.
In this project we are going to create a tool using TinkerPop’s ScriptInputFormat and ScriptOutputFormats that use the existing Linkdb, Crawldb and also Segments to create vector objects which act as vertices to the graph. Linkdb, Crawldb and Segments are the data structures that Nutch uses to store data and use them for further crawls. The aim is to create a graph out of the crawl data that can be traversed enabling real time analytics of web pages. We are going to use Name Graphs for storing the crawl data as web graph. Name graphs are a key concept of semantic web architecture in which a set of Resource Description Framework statements (basically a graph) are identified using a URI(Uniform Resource Identifier). We will be using the N-Quads serialization format of the RDF to exploit the context element, which could also specify to which graph the quad statement belonged to.
The following two scenarios are under consideration in the scope of this project.
Scenario A : Vertex represented by a single n-quad statement
Storing the named graph as serialized n-quad statements of the form
Subject, predicate, object and context are separated with a whitespace, with the advantage that the <vertex_metadata> part will be a serialized structured WebPage record, giving the quad record-specific scope compared to arbitrary local statement names. Here since the objects are stored as condensed serialized n-quad statements without any global scope, for query processing there would be an overhead of de-serializing the quad statements and accessing the crawl properties.
Scenario B : Vertex represented by several n-quad statements
- Storing the named graph as serialized n-quad statements of the form
Subject, predicate, object and context (quad constituents) are separated with a whitespace, with the advantage that the <graphname> part will be a URI, giving the quad Web-global scope compared to arbitrary local statement names. This can be considered as more spread out version of scenario A and prevents the overhead of de-serializing as seen in scenario A.
Example : A vertex is represented as serialized n-quad statements with all the properties related to the crawl.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
The graph developed by the GraphGenerator may also be used to visualize the crawled data as a web graph, which would completely change the perspective with which we look at crawl data in Nutch.
This project would also include upgradation of Apache Nutch code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce, this is proposed because Hadoop claims that the new API would allow programmers write MapReduce jobs in a more convenient, easier and sophisticated fashion. Additonally, Tinkerpop’s ScriptInputFormat and ScriptOutputFormat require Hadoop2 compatability.
Workflow of the project :
The project will follow the corresponding flow of development : - Investigate all the tools that nutch supports.
- Investigate Apache Hadoop. (JobConf and mapreduce APIs)
- Investigate how gremlin is used in TinkerPop for graph traversal.
- Development Phase.
- Writing unit and integration tests parallely with the code being developed.
- Documenting everything during the course of the development.
All the investigations mentioned above will be done in community bonding period.
- Upgrading Nutch master from using org.apache.hadoop.mapred to org.apache.hadoop.mapreduce.
- A well tested plugin/tool to Nutch.
- Test suite consisting of all the test cases.
- Documentation of the whole library.
Community Bonding Period : (May 4 - May 30)
During and before the Community bonding period, a lot of time would be invested in gaining in-depth knowledge about the project and also solving some issues in order to get going really well with the project. Since I have been working with the organisation for a while, I am well aware of all the norms and conventions followed by the organisation. All the investigations mentioned in the workflow will be done during this period.
Coding Period I : (May 31 - June 26)
The coding period will kick off by adding Tinkerpop to Nutch. I will be creating a new data structure say “graphdb” using crawldb, linkdb and segments, that would store the graph generated by tinkerpop. The “graphdb” would also act as one of the command line tools that nutch has. I will be adding this to bin/nutch script. The Graphdb would have several classes under it and those would be decided as the coding period progresses keeping in mind the modularity in Java. The Graphdb will be utilising the scriptInputFormat and ScriptOutput formats of tinkerpop to create a gremlin script with the help of MapReduce Jobs. This script will be used to generate the graph. The graph generated would be stored in graphdb.
Evaluations Phase I : (June 26 - June 30)
The code that will be ready by Coding Phase I, would be tested parallely and all the documentation would be wrapped up during this period. The main aim is to submit a stable bug free code instead of rushing towards the end of the project.
Coding Period II : (July 1 - July 24)
The work mentioned in Coding Period I would be done in parts divided amongst I and II coding periods. All the work lagging in Coding Period I will be taken care of, all the bugs faced would be addressed with the help of the mentor. The end of this period would act as a deadline to the development of the graph generator tool.
Evaluations Phase II : ( July 24 - July 28)
The code that will be ready by Coding Phase II, would be tested parallely, all the documentation would be wrapped up during this period. The main aim is to submit a stable bug free code instead of rushing towards the end of the project.
Coding Period III : (July 29 - August 21)
The master branch of nutch would be upgraded from using org.apache.hadoop.mapred to org.apache.hadoop.mapreduce during this period. These weeks are also scheduled to cope up with the variance in the estimation of the milestones in the roadmap. Some phases might take less time than expected and some phases can take more. Moreover solving the bugs in a definite amount of time cannot be guaranteed. So these weeks help to overcome them.
Final Evaluation : (August 21 - August 29)
All the documentation and code will be wrapped up during this period. All the milestones covered would be clearly mentioned. All the future improvements will also be clearly mentioned. If there is any lag in coping up with the project milestones, detailed steps that would be taken by me to complete the project post GSoC period would be clearly mentioned.
NOTE : This is just an estimate of the project duration. The roadmap may experience some changes accordingly as the project progress and this will be with the consent of my mentor.
Results for the Apache community :
The tool developed as a part of this project will act as a graph generator for Apache Nutch. This graph generator can be used to implement query execution using SPARQL. This could also act as a further improvement. SPARQL is an RDF query language, that is, a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. The web graph generator could also completely change the way we look at nutch crawl data.
Community engagement :
I have been exploring Apache Nutch from the past couple of months. My contributions were very few, since I have spent most of the time in learning more about the project. I have submitted patches (like NUTCH-2366, Documentation), improvements (like NUTCH-2361) and documentation changes. During the course of my contribution I have learnt many things like the way the version control system is used, using the issue tracker, committer coding norms/conventions to be followed for maintaining a well-organised code.
Please find the weekly reports here.