Google Summer of Code proposal
Fourth version (accepted)
13 june 2005
Apache's lenya-search project
Current maintainers and potential mentor(s)
Robert Goene, University of Amsterdam, The Netherlands
I am a Philosophy major at the University of Amsterdam and have several years of programming experience in professional surroundings, mostly in Microsoft related technologies.
I have been working with Lenya for a couple of months now and became a Lenya, Cocoon and Java enthousiast in a very short period.
Areas of interest include the practical uses and theoretical groundings of information systems in the broadest sense thinkable. An xml-based cms fits in perfectly.
- The Lenya-Search project is part of the Lenya Content Management System, as hosted by the Apache Foundation. Heavily based on the XML publishing framework Cocoon, Lenya combines an easy interface for the end-user with advanced possibilities for the xml-aware developer. This makes Lenya both a good choice for straight-forward and more complex websites. The search facilities of Lenya are based on the Apache project Lucene. This search engine takes care of the indexing of documents and processing of the queries. The lenya-search project has found her objective in the integration of Lenya and Lucene. The current integration is not as easy and flexible as it should be for a complete CMS. The indexing process, for instance, depends on a number of home-made indexers that take care of adding all documents to Lucene. This process must be started manually trough an ant job. The indexers are not flexible enough and should be more focussed on the documents that Lenya is dealing with: xhtml documents with Dublin Core metadata. Besides this, custom documents should easily be added to the CMS. Lenya should be able to handle xml documents of all kinds in a more straightforward way. This proposal is part of this more general goal.
- In other words: the search facilities should be further integrated in Lenya. The search possibilities are not trivial to use in a Lenya publication, and they obviously should be. The development will be based on the current trunk of the project: version 1.4 This major release contains a large number of architectual changes. A change like the one described here is appropriate to add to this new future release. The current stable version (1.2) will only be updated with crucial bugfixes. No significant new features will be added.
- The project will consist of a number of subprojects, which can be developed fairly isolated from each other. This section will give a functional description and an overview of the techniques used for each individual subproject.
Integrate the indexing process with the Lenya publishing usecases
Index the document when submitted or published
- When a document is submitted or published it should be added to the Lucene index immediately. This can be accomplished by extending the publish process, which is implemented as a Lenya 1.4 usecase.
Remove the document from the index when deactivated
- Documents that are no longer a part of the 'Live' section of the Lenya publication (the public available website) should be immediately removed from the Lucene index. In a similar fashion as the publishing of a document, the deactivate usecase of Lenya 1.4 should be extended with a removal of the document of the Lucene index.
- Lucene comes packed with a standard xml and html parser to add documents to the index. This parser fetches the data out of the document and stores this data in different fields of the Lucene index. The documents that Lenya works with are extended XHTML documents that can be parsed with the standard html parser, but they would lack the possibility of indexing the metadata that comes with these Lenya documents. As a replacement for the Configurable Indexer that creates indexes from a document based on a collection of xpath statements, i would like to propose an alternative way of configuring the indexed data. This replacement would consist of tags in the internal xml documents of Lenya. Every xml element that must be added to the index need a special attribute, something like indexField="fieldName". One of the big advantages of this approach would be the availability of data that isn't visible for the outside world, but could be helpful for the search mechanism to determine the most relevant results. One could think of the metadata that isn't completely rendered to html, like the date of creation or the creator. Besides this, it would be more easy to add a new document type to Lenya when the indexing of the document can be specified in the sample document and the Relax NG schema. Every document in Lenya has an accompanying RelaxNG schema that validates every edited document when it is saved. This schema should allow a document to have the index attribute assigned to a number of elements. These elements should be extended with the lenya.index element. This must be done for all elements that are allowed to be added to the Lucene index. This may sound like a lot work, but it shouln't be that hard. An XHTML document, for instance, only needs several metadata and the body elements to be specified. The following snippet shows the use of the lenya:index element.
<lenya:index fieldname="title" includechilds="false" indexattribute="attributename" />The mandatory 'fieldname' is the Lucene index field to populate, the 'includechilds' boolean is used to include all childs of the current node. The non-default value "false" will result in the processing of the current node only. The 'indexattribute' contains an optional reference to the attribute to index. When no value is supplied, the text of the node is placed in the supplied field Notice the possibility to add more than one lenya:index element. This makes it possible to add the same data to different fields, which can be useful when the user want a general or more specific search: the data will be added to the more general field that is also used for other fields and the specific field is queried when one is aware of the exact field that must be addressed. The actual xml document must add the lenya:index elements to the elements that must be indexed. The actual field name is specified in the xml document and not the specification. This makes filling the index more flexible, without making it harder to have a common indexfield for all document. Since all documents are created from a sample xml file, the default indexfields can be provided in this file. This way individual exceptions are still possible. The Lenya index parser, as described above, must be applied to the most used document in Lenya: the XHTML document that is extended with Dublin Core metadata.
- By Adding an extra field to the metadata of the documents called 'Document Boost' it will be possible to use the boosting feature of Lucene to control the relevance of specific documents in the search results. A pulldown menu with a choosable digit to specify the boostlevel should be sufficient.
Extract external links
- The publish process should also extract all the external links - html and pdf - from the document and add them to the nutch crawler, so they can be fetched and indexed in the next Nutch run. In a similar fashion, the external links should be removed from the Nutch fetch list and the Lucene index when deactivating a document.
Nutch integration for external crawling
- It should be possible to add external pages to the Lucene index. For instance pages that are part of the website, but are not controlled by Lenya or external pages that contain related content. The crawling of these sites will not be a problem. Linking to external pages on one of the pages controlled should be enough to crawl these pages and add them to the lucene index.
Schedule the nutch indexing task
- The indexing of the external pages that have been extracted as links during the indexing of a document are fetched and indexed by Nutch. These documents can be html or pdf ones, as Nutch is able to handle these types. The list of links to index will be crawled and indexed by Nutch and added to the Lucene index. This process will be a scheduled job that will run from time to time, which can be controlled from the Lenya Administrator interface.
Create Usecase for searching the current publication
- The current search pipeline is not a part of a specific publication, but is part of the general lenya configuration. By making it a usecase, it will be more convenient to address the search facility from a html form and it will be easier to change the search needs of a specific publication. Another reason to move to usecases is the fact that Lenya 1.4 makes standard use of these usecases. Solprovider already has implemented a feature like this. In my opinion, it looks pretty good, but can be revised and simplified with the changes proposed in this document, especially the replacement of the generator.
Change the communiation of Lenya with Lucene
- The communication of Lenya with the Lucene index is pretty nasty at the moment. The current approach is the use of a custom xsp page, that contains server processed java code that communicates with the Lucene API. This code is not very flexible nor extendable programmed. Making small changes to the result set can take a very long time to implement. Different approaches to change this are possible: using the Cocoon Lucene Query Bean, that makes all Lucene search features available to any Cocoon application, or the use of a custom navigational component and the standard Cocoon search generator. The latter approach seems the most appropriate to me, because of the highly customizable nature of Lenya that only needs knowledge of XSLT. The Lucene Query Bean offers possibilities for both common and advanced uses, but seems to lack the customization that a navigation component based on a xslt sheet only can offer.
Replace custom Lucene search generator with Cocoon Search generator
- There is a very clean and easy alternative to this nasty xsp page the xslt sheets that process the result it: the Cocoon search generator By using this generator instead of the clumpsy search pipeline currently employed, it will be easier to debug or change the resultset for a specific publication. Besides this, it seems to me as a good practice to take advantage of Cocoon's facilities as much as possible.
Simplify the current search navigation component
- Make the current search form more usable, visually attractive and easier to integrate to a publication. Change the current navigation component - search.xsl - to be compatible with the new interface and change its apperance.
Related navigation component
- Besides the results of a explicit query of the user, it could be interesting to add a navigation component that searches the Lucene index for related pages. This could be done on the subject or the description fields of the document. The results can be integrated in the document as a flexible way of navigation trough the publication.
- 14 june 05: Proposal deadline
- 24 june 05: Acceptance or rejection of proposal
- 06 july 05: Index when publishing
- 06 july 05 Remove when deactivated
- 14 july 05: Document parser
- 21 july 05: Nutch integration
- 28 july 05: Search usecase
- 28 july 05: Search Generator
- 28 july 05: Search navigation component
- 28 july 05: Related navigation component
- 01 sept 05: Pencils down
These considerations are no formal requirements of this proposal, but are sidetracks that could play a role in future developments. By writing them down, they become part of the considerations for the current proposal without being a direct goal of the project as described above itself.
Add Lucene indexviewer
- To have an overvieuw of the created index it should be fairly simple to integrate the
indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the Lenya interface. The viewer is an easy tool to dig into the created index when the search results are different than you expected. This tool is indispensable when working with the Configurable Indexer to have an overview of the created Lucene fields and their content. The tool is written as an Apache Licensed java servlet and the only information it needs to function is the path to the Lucene index. The integration should therefor be fairly easy.
Jackrabbit and Lucene
- The role of Jackrabbit seems to apply to more structured queries as XQuery makes possible. The unstructured fulltext searching, as non-computers will use most of the time, is the area of the Lucene engine. When the Lenya API will be changed to make use of all the features that Jackrabbit promisses us, the document parser as proposed above will have to be moved to the Lucene interface of Jackrabbit. Jackrabbit will be responsible for a job that, for the time being, will be executed by Lenya. At this point of time, the Jackrabbit integration is only a future consideration and should be given account for when developing new features. The document parser will be developed with the Jackrabbit API in mind.