ApacheConUs2009MeetUp

We had a "Web Crawler Developer" MeetUp at this year's ApacheCon US in Oakland.

It wound up being an UnMeetUp (MeetDown?) on Wednesday, November 4th from 11am - 1pm.

Attendees

Includes both domain (www.x.com == x.com), path, and query portions of URL
Often site-specific rules
- Option to derive rules using URLs to similar documents.

Want to avoid using an API that's tied too closely to any implementation.
One option is to have simple (e.g. URL param) API that takes meta-data.
- Similar to Tika passing in of meta-data.

As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
As part of Droids - but Droids is both a framework (queue-based) and set of components.
New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.

Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
Make decision about build system (and then move on to code formatting debate
- I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
Start contributing code
- Ken will put in robots.txt parser.

Below are some potential topics for discussion - feel free to add/comment.

Potential synergies between crawler projects - e.g. sharing robots.txt processing code.
How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite.
Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly.
robots.txt processing - current problems with existing implementations
Avoiding crawler traps - link farms, honeypots, etc.
Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?)
Testing challenges - is it possible to unit test a crawler?
Fuzzy classification - mime-type, charset, language.
The future of Nutch, Droids, Heritrix, Bixo, etc.
Optimizing for types of crawling - intranet, focused, whole web.