This page discusses uncommitted code and design. See LUCENE-2899 for the main JIRA issue tracking this development. The issue is packaged as a Solr contrib, but is split between Lucene and Solr.
NLP is a large field of inquiry. Unless you are familiar with it you may find this patch confusing. The Apache OpenNLP project is the best place to learn what this package can do.
OpenNLP is a toolkit for Natural Language Processing (NLP). It is an Apache top-level project located here. It includes implementations of many popular NLP algorithms. This project integrates some of its features into Lucene and Solr. This first effort incorporates Analyzer chain tools for sentence detection, tokenization, Parts-of-Speech tagging (nouns, verbs, ejaculations, etc.), Chunking (noun phrases, verb phrases) and Named Entity Recognition. See the OpenNLP project page for information on the implementations. Here are some use cases:
Indexing interesting words
NLP lets you create a field with only the nouns in a document. This would be useful for many free text applications. The FilterPayloadsFilter and StripPayloadsFilter below are required for this. See "Full Example" below.
Chunking lets you create N-Grams only within noun and verb phrases.
Named Entity Recognition
Named Entity Recognition identifies names, dates, places, currency and other types of data within free text. This is profoundly useful in searching. Or, you can create facets or autosuggest entries with icons for 'Name', 'Place', etc.
The OpenNLP Tokenizer behavior is similar to the WhiteSpaceTokenizer but is smart about inter-word punctuation. The term stream looks very much like the way you parse words and punctuation while reading. The OpenNLP taggers assign payloads to terms. There are tools to filter the term stream according to the payload values, and to remove the payloads.
Tokenizes text into sentences or words.
This Tokenizer uses the OpenNLP Sentence Detector and/or Tokenizer classes. When used together, the Tokenizer receives sentences and can do a better job. The arguments give the file names of the statistical models:
<fieldType name="text_opennlp" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin" /> </analyzer> </fieldType>
Tags words using one or more technologies: Parts-of-Speech, Chunking, and Named Entity Recognition.
<fieldType name="text_opennlp_pos" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="opennlp/en-token.bin" /> <filter class="solr.OpenNLPFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin" /> </analyzer> </fieldType>
This example assigns parts of speech tags based on a model derived with the OpenNLP Maximum Entropy implementation. See OpenNLP Tagging for more information. The tags are from the Penn Treebank tagset
Filter terms for certain payload values. In this example, retain only terms which have been marked 'nouns' and 'verbs' with the Penn Treebank tagset.
<filter class="solr.FilterPayloadsFilterFactory" keepPayloads="true" payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>
Remove payloads from terms.
This "Noun-Verb Filter" field type assigns parts of speech, retains only nouns and verbs, and removes the payloads. Free-text search sites (for example, newspaper and magazine articles) may benefit from this.
<fieldType name="text_opennlp_nvf" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="opennlp/en-token.bin" /> <filter class="solr.OpenNLPFilterFactory" posTaggerModel="opennlp/en-pos-maxent.bin" /> <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/> <filter class="solr.StripPayloadsFilterFactory"/> </analyzer> </fieldType>
This example should work well with most English-language free text.
For English language testing: Until LUCENE-2899 is committed:
- pull the latest trunk or 4.0 branch
apply the latest LUCENE-2899 patch
- do 'ant compile'
- cd solr/contrib/opennlp/src/test-files/training
- run 'bin/trainall.sh'
- this will create binary files which will be included in the distribution when committed.
Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles the OpenNLP lucene and solr code against the OpenNLP libraries and uses the small model files.
Deployment to Solr
A Solr core requires schema types for the OpenNLP Tokenizer & Filter, and also requires "real" model files. The distribution includes a schema.xml file in solr/contrib/opennlp/src/test-files/opennlp/solr/conf/ which demonstrates OpenNLP-based analyzers. It does not contain other text types (to avoid falling out of date with the full text suite). You should copy the text types from this file into your test collection schema.xml, and download "real" models for testing. Also, you may have to add the OpenNLP lib directory to your solr/lib or solr/cores/collection/lib directory. The text types assume that cores/collection/conf/opennlp contains the OpenNLP model files.
This server has "real" models for the OpenNLP project. Download model files to your solr/cores/collection/conf/opennlp directory.
- The English-language models start with 'en'. The 'maxent' models are preferred to the 'perceptron' models.
Your Solr should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like to see them in text form, then go vote on SOLR-3493 (or implement it).
The OpenNLP library is Apache. The 'jwnl' library is 'BSD-like'.
The contrib directory includes some small training data and scripts to generate model files. These are supplied only for running "unit" tests aginst the complete Solr/Lucene/OpenNLP code assemblies. They are not useful for exploring OpenNLP's features or for production deployment.
In solr/contrib/opennlp/src/test-files/training, run 'bin/trainall.sh' to populate solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp with the test models. The schema.xml in that conf/ directory uses those models for running the unit tests.
The models available from Sourceforge are created from licensed training data. I have not seen a formal description of their license status, but they are not "safe" for Apache. If you want production-quality models for commercial use, you will need to make other arrangements. is you have to download statistical models from sourceforge to make OpenNLP work- the models do not have an Apache-compatible license.