Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

115520962

This page provides commentary and thoughts on adapting Nutch not only to fetch AJAX/JavaScript driven dynamic HTML content, but also for interacting with that content (potentially a number of times) within a fetching scenario.

Table of Contents

Lets Begin with a Scenario

...

  • use Nutch to log in to a database which requires HTTP POST authentication
  • follow the redirect to the database landing query form
  • submit a query to the form which will return a ranked list of search results for the given query
  • interpret the JavaScript for each result in the ranked list
  • use an HtmlParseFilter to obtain high level article/document content
  • submit a GET request to invoke JavaScript which will return a PDF of the full textual content for this document
  • return the full document (PDF) content and metadata along with the HTML parse filter data

Crawling JavaScript/AJAX sites

In order to crawl webpages that rely on JavaScript/AJAX to dynamically load content you will want to use the Protocol-Selenium Plugin. This plugin will load the pages that you're crawling in Selenium so that JavaScript will be handled properly.

If you need to interact with the pages that you're crawling (E.g., JavaScript based pagination, clicking elements to dynamically load content) you will want to use the Protocol-InteractiveSelenium plugin. With this plugin you will create Handlers that interact with the pages in a defined way using the Selenium WebDriver interface. With this you'll be able to do any Selenium based interactions that you wish on a per-URL basis.

...

FAQs

  • How do I suppress Firefox from popping up during a Selenium crawl?
    1. Assign Firefox a particular space: Move Firefox to a dedicated space. Then, right-click on the Firefox icon in the Dock and go to Options > Assign To > This Desktop 2. Add the following key as <dict>'s children in /Applications/Firefox.app/Contents/Info.plist
      <key> LSBackgroundOnly </key>
      <string> True </string> 3. Quit Firefox. 4. Start crawling with Selenium. You will notice that Firefox will open silently in its own assigned space.