This page outlines our medium- to long-term vision of Nutch goals. It's important both to users and developers to understand what we would like to achieve with Nutch in the timeframe of next couple of months to a year, and beyond.
Nutch is a community-driven effort, so you are welcome to contribute your view to this page. When adding to the sections below, think also about the potential impact on the target audience - but also think about the development efforts needed to reach this goal, and set the goals realistically ...
Architecture-related goals
Scalability and performance
Plugins and modularity
The plugin api itself is perhaps not optimal - For each interface (plugin) there is a factory. Each factory contains clutter code for doing similar things internally.
Refactor clutter code from factories to utility class or improve plugin system api and move clutter into plugin system itself.
Allow plugins to be packaged inside independent .jar. to ease distribution of plugins and allow plugins to be developed more easily outside of Nutch.
Evaluate some other strategies for extending Nutch (OSGi, some DI framework, Standard java extension mechanism...)
Configuration, management and control
Automation scripts for job streams
Internationalization
Page and link databases
Scoring and ranking
Supported content
Fetching
Incremental or checkpointed fetching. The ability to keep previously fetched results even if the fetching job fails somewhere along the way.
Segment management
Indexing
Distributed search - architecture and deployment
Target audience-related goals
Nutch for small-scale users
Nutch for large-scale users
Nutch for system integrators
Nutch for developers
Generally speaking I believe we should give more focus towards developer-kind-of-users of Nutch (people who build search services on Nutch). For this to happen a great deal of improvement can be done: Currently most of the tools in Nutch are not easy to customize (not talking about configuration here) without copy paste coding. Also unit testing is a pain as the pieces to test are huge.
To tackle this we could perhaps introduce pieces of smaller units of functionality and allow users to construct (command- or composite pattern) their specific application from these smaller pieces instead of offering one monolithic tool.
I believe also decoupling configuration from the "caching-within-configuration" pattern used in Nutch would also clear things up significantly.
Build scripts and example eclipse projects (other ides too) for developing plugins in separate projects.
Add more support for maven2 users with efforts like publishing core libraries to maven repositories, starting from apache snapshot repository. Build maven archetypes (one for each extension point?) for faster bootstrapping for m2 developers.
Nutch for researchers