Support Sitemap Crawler in Nutch 2.x Midterm Report

Title :

GSOC 2015 Midterm Report

 

Reporting Date :

25th June 2015

 

Issue :

 

NUTCH-1741 - Support Sitemap Crawler in Nutch 2.x

Student :

Cihad Güzel - cguzelg@gmail.com

 

Mentors :

 

Lewis John McGibbney, Talat Uyarer

Development Codebase: :

 

Github Repo Url

Introduction

It is tried to develop a crawler for sitemap files. This report includes the development of the works and following plans.

Research and development were done in last 4 weeks for sitemap crawler. The timeline said in proposal was followed successfully. It is aimed to complete the project in target time for rest weeks.

Previous Actions

Some studies were done to make sitemap files a part of nutch life cycle that are Injector , Fetcher , Parser and DbUpdater.

It is possible to crawl sitemap files in two way as followings:

  1. The url list wanted to be crawled is listed in seed file. The urls in this list is crawled by passing nutch life cycle. If sitemaps files are also wanted to crawl, this sitemap file paths are defined in seed file.Normally, a seed file must be as following:
    *http://www.example.com/
    *http://www.example2.com/
    *http://www.example3.com/
    If you have two sitemap files for "http://www.example.com/" , you can define them in the seed file as following:
    *http://www.example.com/ sitemaps: sitemap1.xml sitemap2.xml
    *http://www.example2.com/
    *http://www.example3.com/
    Thus the sitemap file paths that are wanted to be crawled are defined manuelly in seed file. When the InjectorJob works; the sitemap near the “sitemap” label is written in database as if it is a new url. These urls are crawled by passing form normal nutch life cycle.
    The urls are signed as sitemap in database. Thus during the part of parse, only urls singed are parsed by “sitemap-parser”.
    2. Besides the first way, sitemap files can be detected automatically. During the fetch of urls, robot.txt file is checked. The aim of this is to organise crawling according to rules defined in robot.txt file. This robot file can include not only these rules but also sitemap files list. At the time of fetch, the sitemap files that are listed in robot.txt are checked. If there is a list, urls are added in db.

A column named “stm” is added in database to add sitemap files detected during the fetch. “stm” is the abbreviation of “sitemap”.

The url list in “stm” column is added as new line in database during Dbupdater and thus it is added in nutch life cycle.

To finish the crawling sitemap, while sitemap urls added during the inject passes once, the urls detected from robot.txt passes twice. Because the urls are recorded as new line directly at the time of inject and parsed later. However at the time of detection, first it is added in stm column, then it is added as new line at [DbUpdater] step and finally it is parsed [3].

A parser plugin was written to complete parsing. This plugin sends urls in sitemap files to db, after parsing sitemap urls. Thus the crawling of a sitemap url is completed. To activate sitemap plugin, it must be added to “nutch-site.xml” like the other plugins. In the result of this work, multi-sitemap parsing can be done. And also, only inlinks are got after sitemap parsing. If there is any outlink defined in sitemap file, they are ignored.

The works mentioned till this moment are done according to following steps [4].

  • Week1(25May-31May): sitemap list injection:
  • Week2(1June-7June): sitemap detection:
  • Week3&4(8June-21June): sitemap parser plugin
  • Week5(22June-28June): Dbupdater is improved for sitemap crawler

Future Plans

Sitemap crawling was done basicly until now. However improvement must be done for every step. In following weeks, some improvement will be done by taking into account the features of sitemap file. As determined in proposal, a plan is programmed for other weeks [5].

  • Week6&7(29June-12July): Sitemap ranking mechanism will be developed.
  • Week8(13July-19July): Sitemap black list, sitemap error detection
  • Week9(20July-26July): Frequent mechanism will be developed
  • Week10(27July-2Agust): A filter plugins will be updated
  • Week11(3Agust-9Agust): Code review and code cleaning.
  • Week12&13(10Agust-23Agust): Further refine tests and documentation for the whole project.

References

*[1] https://issues.apache.org/jira/browse/NUTCH-1741
*[2] https://issues.apache.org/jira/browse/NUTCH-1465
*[3] https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf
*[4] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport
*[5] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler

  • No labels