Differences between revisions 3 and 4
Revision 3 as of 2006-10-07 20:09:52
Size: 4015
Editor: SamiSiren
Comment:
Revision 4 as of 2009-09-20 23:10:07
Size: 4033
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
"[http://www.osgi.org/ OSGi] technology is the dynamic module system for Java™ OSGi technology provides a service-oriented, component-based environment for developers and offers standardized ways to manage the software lifecycle." "[[http://www.osgi.org/|OSGi]] technology is the dynamic module system for Java™ OSGi technology provides a service-oriented, component-based environment for developers and offers standardized ways to manage the software lifecycle."
Line 11: Line 11:
 * Evaluate how OSGi (more specifically [http://cwiki.apache.org/FELIX/index.html Apache Felix]) could fit in nutch, perhaps the easiest place to start are is the plugin system and plugins  * Evaluate how OSGi (more specifically [[http://cwiki.apache.org/FELIX/index.html|Apache Felix]]) could fit in nutch, perhaps the easiest place to start are is the plugin system and plugins
Line 35: Line 35:
["NutchOSGiConfiguration"] Acts as a Decorator for [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configuration.html Configuration] object. Adjusts classpath so that Configuration object can find the configuration files from inside bundle (hadoop-default.xml, hadoop-site.xml, nutch-default.xml, nutch-site.xml)| [[NutchOSGiConfiguration]] Acts as a Decorator for [[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configuration.html|Configuration]] object. Adjusts classpath so that Configuration object can find the configuration files from inside bundle (hadoop-default.xml, hadoop-site.xml, nutch-default.xml, nutch-site.xml)|
Line 37: Line 37:
["PluginHelper"] Listens for bundle activations and registers plugin-bundles as plugins into nutch plugin system (osgi plugins cannot depend on any non OSGi plugin) [[PluginHelper]] Listens for bundle activations and registers plugin-bundles as plugins into nutch plugin system (osgi plugins cannot depend on any non OSGi plugin)
Line 39: Line 39:
["OSGIPluginDescriptor"] Adapts bundle to [http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/PluginDescriptor.html PluginDescriptor] [[OSGIPluginDescriptor]] Adapts bundle to [[http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/PluginDescriptor.html|PluginDescriptor]]
Line 41: Line 41:
["OSGiExtension"] Adapts bundle to [http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/Extension.html Extension] [[OSGiExtension]] Adapts bundle to [[http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/Extension.html|Extension]]
Line 43: Line 43:
["OSGiPlugin"] Adapts bundle to [http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/Plugin.html Plugin] [[OSGiPlugin]] Adapts bundle to [[http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/plugin/Plugin.html|Plugin]]
Line 47: Line 47:
Hadoop Configuration should really be made Interface ([http://issues.apache.org/jira/browse/HADOOP-24 HADOOP-24]) and some other configuration method (but files inside jars) should be invented so we can separate nutch and hadoop to two bundles. This would allow us to run different configurations (different versions of each package/bundle) more easily. Hadoop Configuration should really be made Interface ([[http://issues.apache.org/jira/browse/HADOOP-24|HADOOP-24]]) and some other configuration method (but files inside jars) should be invented so we can separate nutch and hadoop to two bundles. This would allow us to run different configurations (different versions of each package/bundle) more easily.
Line 56: Line 56:
 * [http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200606.mbox/%3C200606061614.55212.dominik@wipe-records.org%3E Message] in hadoop-dev mailing list.
 * [http://wiki.apache.org/cocoon/Blockathon2005Report Cocoon] goes OSGi
 * [[http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200606.mbox/%3C200606061614.55212.dominik@wipe-records.org%3E|Message]] in hadoop-dev mailing list.
 * [[http://wiki.apache.org/cocoon/Blockathon2005Report|Cocoon]] goes OSGi

Nutch with osgi

"OSGi technology is the dynamic module system for Java™ OSGi technology provides a service-oriented, component-based environment for developers and offers standardized ways to manage the software lifecycle."

Why

  • I like the idea of extensions (components, OSGi bundles) being packaged in single .jar file
  • By thinking in components/services/interfaces the design in generall will naturally be more clear and less tight couplings are used

Short term goals

  • Evaluate how OSGi (more specifically Apache Felix) could fit in nutch, perhaps the easiest place to start are is the plugin system and plugins

  • Build minimal prototype system that can complete full crawling cyckle
  • Verify that required actions could be done non intrusively, eg. nutch-osgi must not introduce any changes to nutch or hadoop (but can of course bring up things discovered and "wishlist" to these projects to make life easier for nutch-osgi)
  • Add support to use OSGi bundles as nutch plugins (if applicable)

What has been done so far

Build system was mavenized. Required jars were grouped together and packaged as OSGi-bundles, packaking was made with maven2 and with help of maven-osgi-plugin. (maven2 is not required but IMO it makes things easier)

Bundles and their contents

bundle

jars

nutch-hadoop

hadoop-0.9.0-SNAPSHOT,hadoop-0.5.1-SNAPSHOT

hadoop-nutch-common-deps

jetty-5.1.4,lucene-misc-1.9.1,lucene.core-1.9.1,commons-cli-2.0-SNAPSHOT,commons-logging-1.0.2,log4j-1.2.13

nutch-deps

concurrent-1.3.4,commons-lang-2.1,oro-2.0.4

protocol-http

lib-http-0.9.0-SNAPSHOT.jar,protocol-http-0.9.0-SNAPSHOT

scoring-opic

scoring-opic-0.9.0-SNAPSHOT

urlfilter-prefix

urlfilter-prefix-0.9.0-SNAPSHOT

nutch-osgi-adapter

Contains custom code required to run nutch inside OSGi container

Identified gluecode (so far)

NutchOSGiConfiguration Acts as a Decorator for Configuration object. Adjusts classpath so that Configuration object can find the configuration files from inside bundle (hadoop-default.xml, hadoop-site.xml, nutch-default.xml, nutch-site.xml)|

PluginHelper Listens for bundle activations and registers plugin-bundles as plugins into nutch plugin system (osgi plugins cannot depend on any non OSGi plugin)

OSGIPluginDescriptor Adapts bundle to PluginDescriptor

OSGiExtension Adapts bundle to Extension

OSGiPlugin Adapts bundle to Plugin

Some observations, ideas during the (still continung) trip

Hadoop Configuration should really be made Interface (HADOOP-24) and some other configuration method (but files inside jars) should be invented so we can separate nutch and hadoop to two bundles. This would allow us to run different configurations (different versions of each package/bundle) more easily.

Current nutch script propably need to be (re)implemented with java as there will propably be one front door to enter (start) OSGi nutch

It would be nice if lucene and hadoop were "natively" build as osgi bundles, all this requires is a custom manifest inside .jar. It would also be nice if lucene family would automatically build and deploy packages to m2 repositories.

NutchOSGi (last edited 2009-09-20 23:10:07 by localhost)