Title :

 

GSOC 2015 Proposal

Issue :

 

NUTCH-1741 - Support Sitemap Crawler in Nutch 2.x

Student :

Cihad Güzel - cguzelg@gmail.com

 

Mentors :

 

Lewis John McGibbney, Talat Uyarer

Abstract

The url’s can be got from only pages that were scanned before in nutch crawler system. This method is expensive. Also, the degrees of importance and “change frequance” of these urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will have that support of sitemap crawler thanks to this development.

Introduction

Sitemap is a file guiding to crawl website in a better way and it has different file formats (such as simple text format, xml format, rss 2.0, atom 0.3 & 1.0).

It is possible to find the whole of urls in a up-to-date sitemap file. Websites can be crawled faster by means of sitemap crawler that will be developed. In addition, some knowledge can be detected such as “change frequance”, “last update time” and “the priority” of the pages. Shortly, a better url list will be got easily and fast from sitemap file thanks to this software. It is another advantage that this process is under the control of the user. Finally, when the project concluded;

Project Details:

It is aimed to power nutch project by sitemap crawler support. The main target is to detect the sitemap having correct urls and to be crawled. It is easy and fast to find correct ursl by sitemap crawler. The software will make following features possible.

  1. sitemap detection: Sitemap files will be detected automatically, if available.

The advatages of the process of developing project.

  1. The new features that will be developed can be entegrated easily thanks to the nutch pluginer design and nutch life cycle.

Timeline:

Project development process can be divided into two steps. Firstly, nutch crawler life cycle will be updated for sitemap crawler. Sitemap will be crawled in a simple way before midterm. In the next stage, Other issues will be completed such as sitemap detection, filter & ranking mechanizm, documentation and tests.

  *Pre-GSoC :*  The studies and the comments on NUTCH-1741 \[1\] and NUTCH-1465 \[2\] will be followed. 

Reference:

 *\[1\] https://issues.apache.org/jira/browse/NUTCH-1741
 *\[2\] https://issues.apache.org/jira/browse/NUTCH-1465
 *\[3\] https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf

Reports

Documentation

Documents will be added here.

Source Code

source code on github

Jira Issues