This page outlines our medium- to long-term vision of Nutch goals. It's important both to users and developers to understand what we would like to achieve with Nutch in the timeframe of next couple of months to a year, and beyond.
Nutch is a community-driven effort, so you are welcome to contribute your view to this page. When adding to the sections below, think also about the potential impact on the target audience - but also think about the development efforts needed to reach this goal, and set the goals realistically ...
- Scalability and performance
- Plugins and modularity The plugin api itself is perhaps not optimal - For each interface (plugin) there is a factory. Each factory contains clutter code for doing similar things internally.
- Refactor clutter code from factories to utility class or improve plugin system api and move clutter into plugin system itself.
- Allow plugins to be packaged inside independent .jar. to ease distribution of plugins and allow plugins to be developed more easily outside of Nutch.
- Evaluate some other strategies for extending Nutch (OSGi, some DI framework, Standard java extension mechanism...)
- Configuration, management and control
- Automation scripts for job streams
- Page and link databases
- Scoring and ranking
- Supported content
- Incremental or checkpointed fetching. The ability to keep previously fetched results even if the fetching job fails somewhere along the way.
- Segment management
- Distributed search - architecture and deployment
Target audience-related goals
- Nutch for small-scale users
- Nutch for large-scale users
- Nutch for system integrators
- Nutch for developers Generally speaking I believe we should give more focus towards developer-kind-of-users of Nutch (people who build search services on Nutch). For this to happen a great deal of improvement can be done: Currently most of the tools in Nutch are not easy to customize (not talking about configuration here) without copy paste coding. Also unit testing is a pain as the pieces to test are huge. To tackle this we could perhaps introduce pieces of smaller units of functionality and allow users to construct (command- or composite pattern) their specific application from these smaller pieces instead of offering one monolithic tool. I believe also decoupling configuration from the "caching-within-configuration" pattern used in Nutch would also clear things up significantly.
- Build scripts and example eclipse projects (other ides too) for developing plugins in separate projects.
- Add more support for maven2 users with efforts like publishing core libraries to maven repositories, starting from apache snapshot repository. Build maven archetypes (one for each extension point?) for faster bootstrapping for m2 developers.
- Nutch for researchers