LuceneIndexTransformer is a component that creates or updates Lucene indexes.

This component only writes the index: to search the index, use the SearchGenerator component.

Why use it?

Instead of using LuceneIndexTransformer, you could generate an index by crawling your website. However, the LuceneIndexTransformer is much, much faster than crawling.

The big differences for the developer are:

Declaring the LuceneIndexTransformer

The transformer must be declared in the <transformers> section of your sitemap:

<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">

   <map:components>
      ...
      <map:transformers default="xslt">
         <map:transformer name="index" 
            logger="sitemap.transformer.luceneindextransformer" 
            src="org.apache.cocoon.transformation.LuceneIndexTransformer"/>
      </map:transformers>
      ...
   </map:components>
   ...
</map:sitemap>

Input document for the LuceneIndexTransformer

This is a sample of the kind of document that the transformer expects. NB In this example, I've chosen a couple of simple XHTML documents as the content to be indexed. This is only because everyone knows XHTML - in practice you should typically generate the index from an early stage in the pipeline; indexing DocBook, TEI, etc, rather than a presentation format like HTML.

<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
   analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
   directory="index" 
   create="false" 
   merge-factor="20">

   <lucene:document url="http://localhost/sample.html">
      <!-- here is some sample content -->
      <html>
         <head>
            <title lucene:store="true">Sample</title>
         </head>
         <body>
            <h1>Blah</h1>
            <a href="blah.jpg" title="download blah image"
               lucene:text-attr="title">
               <img src="blah-small.jpg" alt="Blah"
                  lucene:text-attr="alt"/>
            </a>
         </body>
      </html>
   </lucene:document>

   <lucene:document url="http://localhost/sample-2.html">
      <!-- Another sample doc -->
      <html>
         <head>
            <title lucene:store="true">Second Sample</title>
         </head>
         <body>
            <h1>Foo</h1>
            <p>Lorem ipsum dolor sit amet, 
            consectetuer adipiscing elit. </p>
         </body>
      </html>
   </lucene:document>

</lucene:index>

What the lucene:index document means

The lucene:index element

The root element is lucene:index. The attributes of the lucene:index in the sample above are shown with their default values - so the effect is as if they were not specified at all.

The merge-factor and analyzer attributes

See the Lucene documentation for explanations of what they mean.

The optimize-frequency attribute (since version 2.2)

Determines how often the lucene index will be optimized. When you have 1000's of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000 small docs, P4).

You can eg. create a pipe without optimizing, which is used to index you're document everytime when it's modified. You can then create another pipe which will optimize, which is called manually. For more info see the Lucene FAQ , What is index optimization and when should I use it? :

http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8

The directory attribute

This attribute controls where the index files are stored. The path is relative to the Cocoon work directory.

The create attribute

This attribute controls whether the index is recreated.

The lucene:document element

Lucene will index the content of each lucene:document, which may contain any xml content. The index is associated with the url specified by the url attribute. So this url will be returned as the results of a search.

The lucene:text-attr attribute

Normally Lucene will only index the content of these elements, not attribute values. To index the attributes of an element as well, give it an attribute called lucene:text-attr, containing a list of the names of the attributes you want indexed. For example, to index the value of the alt attribute of an img element, in html:

<img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/>

This would index the text "Blah".

The lucene:store attribute

Normally Lucene will only index the text of an element, not store it. To store the text of an element in Lucene's index, add a lucene:store="true" attribute to the element. It's a good idea to store the title of a document in Lucene, so that your search results can show a document title as well as a URL.

The transformation

The transformer copies the source document to the output, except for the content of the lucene:document elements.

The transformer also adds an elapsed-time attribute to the output lucene:document elements, showing the time (in milliseconds) taken to index that document. You can use XSLT to transform the results into a report on the indexing operation.

Sample output

<?xml version="1.0" encoding="UTF-8"?>
<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
        merge-factor="20" 
        create="false" 
        directory="index" 
        analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
        <lucene:document url="JCB-001/full.html" elapsed-time="3846"/>
        <lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/>
        <lucene:document url="JCB-002/full.html" elapsed-time="361"/>
        <lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/>
        <lucene:document url="JCB-003/full.html" elapsed-time="300"/>
        <lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/>
</lucene:index>

Note to users of Mac OS X

Java can not open more than 256 files at a time by default, so you may get an error like the following:

Description: org.apache.cocoon.ProcessingException: 
Failed to execute pipeline.: java.lang.RuntimeException: 
java.io.FileNotFoundException:  
/usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
(Too many open files)

To avoid this error, you should set your ulimit in the shell script that starts Tomcat. My line reads as follows:

ulimit -S -n 1000

Read more about this here: http://www.amug.org/~glguerin/howto/More-open-files.html

Note to users of Redhat Linux

If you get the following error: (Empty StackException) while creating the index with the LuceneIndexTransformer try to alter your merge-factor to a lower value (default should be 10). Look at the Lucene documentation for more information.

LuceneIndexTransformer (last edited 2009-09-20 23:42:51 by localhost)