Lets Begin with a Scenario
So lets say that as a Nutch crawl administrator your client has tasked you with the following "Get me domain specific material from a database such as NTIS" (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) What this really translates to is the following:
use Nutch to log in to a database which requires HTTP POST authentication
- follow the redirect to the database landing query form
- submit a query to the form which will return a ranked list of search results for the given query
use an HtmlParseFilter to obtain high level article/document content
- return the full document (PDF) content and metadata along with the HTML parse filter data
Related Development Issues
Nutch Selenium Plugin NUTCH-1933
http://grid.selenium.googlecode.com/git-history/24150d2e97090b8b439bcc6a396911fb53200749/src/main/webapp/step_by_step_installation_instructions_for_osx.html - Installation instructions for Selenium Grid 2 on a Mac (needed for the momer/nutch-selenium-grid-plugin).
http://grid.selenium.googlecode.com/git-history/00eae2a86d81c4ef8da355b0a8b916a9095a5cd9/src/main/webapp/download.html - latest version of Selenium Grid (Ver 1.0.8).
Assign Firefox a particular space: Move Firefox to a dedicated space. Then, right-click on the Firefox icon in the Dock and go to Options > Assign To > This Desktop
Add the following key as <dict>'s children in /Applications/Firefox.app/Contents/Info.plist
<key> LSBackgroundOnly </key>
<string> True </string>
- Quit Firefox.
- Start crawling with Selenium. You will notice that Firefox will open silently in its own assigned space.