This component only writes the index: to search the index, use the SearchGenerator component.
Instead of using LuceneIndexTransformer, you could generate an index by crawling your website. However, the LuceneIndexTransformer is much, much faster than crawling.
The big differences for the developer are:
lucene:index
document describing your searchable URI space, so it's necessary to have a well-defined URI space. For a site with a consistent structure this should not be too hard. This pipeline can use aggregation and inclusion mechanisms to produce a full list of the pages you want to search. In this way it's also possible to generate an index for websites with forms which are not crawlable.
content
view and a links
view for each of the searchable pipelines. The URI space is then defined by crawling the links
view.
The transformer must be declared in the <transformers>
section of your sitemap:
<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"> <map:components> ... <map:transformers default="xslt"> <map:transformer name="index" logger="sitemap.transformer.luceneindextransformer" src="org.apache.cocoon.transformation.LuceneIndexTransformer"/> </map:transformers> ... </map:components> ... </map:sitemap> |
This is a sample of the kind of document that the transformer expects. NB In this example, I've chosen a couple of simple XHTML documents as the content to be indexed. This is only because everyone knows XHTML - in practice you should typically generate the index from an early stage in the pipeline; indexing DocBook, TEI, etc, rather than a presentation format like HTML.
<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" directory="index" create="false" merge-factor="20"> <lucene:document url="http://localhost/sample.html"> <!-- here is some sample content --> <html> <head> <title lucene:store="true">Sample</title> </head> <body> <h1>Blah</h1> <a href="blah.jpg" title="download blah image" lucene:text-attr="title"> <img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/> </a> </body> </html> </lucene:document> <lucene:document url="http://localhost/sample-2.html"> <!-- Another sample doc --> <html> <head> <title lucene:store="true">Second Sample</title> </head> <body> <h1>Foo</h1> <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. </p> </body> </html> </lucene:document> </lucene:index> |
The root element is lucene:index
. The attributes of the lucene:index
in the sample above are shown with their default values - so the effect is as if they were not specified at all.
See the Lucene documentation for explanations of what they mean.
Determines how often the lucene index will be optimized. When you have 1000's of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000 small docs, P4).
You can eg. create a pipe without optimizing, which is used to index you're document everytime when it's modified. You can then create another pipe which will optimize, which is called manually. For more info see the Lucene FAQ , What is index optimization and when should I use it? :
http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8
This attribute controls where the index files are stored. The path is relative to the Cocoon work
directory.
This attribute controls whether the index is recreated.
create = "false"
.
create = "true"
then any existing index will be destroyed and a new index created. If you are rebuilding your entire index then you should use create = "true"
because the indexer doesn't need to remove old documents from the index, so it will be faster.
Lucene will index the content of each lucene:document
, which may contain any xml content. The index is associated with the url specified by the url
attribute. So this url will be returned as the results of a search.
Normally Lucene will only index the content of these elements, not attribute values. To index the attributes of an element as well, give it an attribute called lucene:text-attr
, containing a list of the names of the attributes you want indexed. For example, to index the value of the alt
attribute of an img
element, in html
:
<img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/> |
This would index the text "Blah".
Normally Lucene will only index the text of an element, not store it. To store the text of an element in Lucene's index, add a lucene:store="true"
attribute to the element. It's a good idea to store the title of a document in Lucene, so that your search results can show a document title as well as a URL.
The transformer copies the source document to the output, except for the content of the lucene:document
elements.
The transformer also adds an elapsed-time
attribute to the output lucene:document
elements, showing the time (in milliseconds) taken to index that document. You can use XSLT to transform the results into a report on the indexing operation.
<?xml version="1.0" encoding="UTF-8"?> <lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" merge-factor="20" create="false" directory="index" analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"> <lucene:document url="JCB-001/full.html" elapsed-time="3846"/> <lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/> <lucene:document url="JCB-002/full.html" elapsed-time="361"/> <lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/> <lucene:document url="JCB-003/full.html" elapsed-time="300"/> <lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/> </lucene:index> |
Java can not open more than 256 files at a time by default, so you may get an error like the following:
Description: org.apache.cocoon.ProcessingException: Failed to execute pipeline.: java.lang.RuntimeException: java.io.FileNotFoundException: /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 (Too many open files) |
To avoid this error, you should set your ulimit in the shell script that starts Tomcat. My line reads as follows:
ulimit -S -n 1000 |
Read more about this here: http://www.amug.org/~glguerin/howto/More-open-files.html
If you get the following error: (Empty StackException) while creating the index with the LuceneIndexTransformer try to alter your merge-factor to a lower value (default should be 10). Look at the Lucene documentation for more information.