This page provides commentary and thoughts on adapting Nutch not only to fetch AJAX/JavaScript driven dynamic HTML content, but also for interacting with that content (potentially a number of times) within a fetching scenario.

Lets Begin with a Scenario

So lets say that as a Nutch crawl administrator your client has tasked you with the following "Get me domain specific material from a database such as NTIS" (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) What this really translates to is the following:

Crawling JavaScript/AJAX sites

In order to crawl webpages that rely on JavaScript/AJAX to dynamically load content you will want to use the Protocol-Selenium Plugin. This plugin will load the pages that you're crawling in Selenium so that JavaScript will be handled properly.

If you need to interact with the pages that you're crawling (E.g., JavaScript based pagination, clicking elements to dynamically load content) you will want to use the Protocol-InteractiveSelenium plugin. With this plugin you will create Handlers that interact with the pages in a defined way using the Selenium WebDriver interface. With this you'll be able to do any Selenium based interactions that you wish on a per-URL basis.


AdvancedAjaxInteraction (last edited 2016-04-13 05:38:05 by ChrisMattmann)