You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Reindexing in Solr

Terminology

If you use Solr for any length of time, someone will eventually tell you that you have to reindex after making a change. It comes up over and over ... but what does that actually mean?

Most changes to the schema will require a reindex, unless you only change query-time behavior. A very small subset of changes to solrconfig.xml also require a reindex, and for some changes, a reindex is recommended even when it's not required.

(warning) The term "reindex" is not a special thing you can do with Solr. It literally means "index again." You just have to restart Solr (or reload your core), possibly delete the existing index, and then repeat whatever actions you took to build your index in the first place.

(warning) Indexing (and reindexing) is not something that just happens. Solr has no ability to initiate indexing itself. There is the dataimport handler, but it will not do anything until it is called by something external to Solr.

Indexing is something that can be manually done by a person or automatically done by a program, but it is always external to Solr. There is an issue in the bugtracker for adding dataimport handler scheduling to Solr, but it is meeting with committer resistance, because *ALL* modern operating systems have a scheduling capability built in. Also, that would mean that Solr can change your index without external action, which is generally considered a bad idea by committers.

Depending on your setup and goals, you may need to delete all documents before you begin your indexing process. Sometimes it is necessary to delete your index directory entirely before you restart Solr or reload your core.

It's reasonable to wonder why deleting the existing data and building it again is necessary. Here's why: When you change your schema, nothing happens to the existing data in the index. When Solr tries to access the existing data in the index, it uses the schema as a guide to interpreting that data. If the index contains rows that have a field built with the SortableIntField class and then Solr tries to access that data with a different class (such as TrieIntField), there's a good chance that an unrecoverable error will occur.

Using Solr as a Data Source

Don't do this unless you have no other option. Solr is not really designed for this role. Every attempt is made to ensure that Solr is stable, but indexes do get corrupted by unanticipated situations, and by things completely outside developer control. Solr 4.x and later does have NoSQL features, and SolrCloud goes a long way towards high availability, but absolute data reliability in the face of any problem is difficult to achieve for any software, which is why it's always important to have backups.

(warning) Using Solr as a data source to build a new index is only possible if your index meets the explicit requirements for the Atomic Update feature. If some of your fields don't meet this criteria, you won't be able to recover that data. It's simply not possible. The advice about copyFields is particularly important, because you could lose data there or end up with data that's included in the index multiple times.

If you absolutely must use one Solr index as the data source for another index, and you have stored every field except those that shouldn't be stored, you have a few possible options:

  1. Use the dataimport handler with SolrEntityProcessor.
  2. Export the data using Solr queries, then reimport it after making sure it's in the correct format. You could use XML or CSV for this. This is not a trivial process. There is no process or program available from the Solr project for doing this. Here are some possible ideas:
    1. http://grokbase.com/t/lucene/solr-user/134p562kxs/export-index-and-re-index-xml
    2. http://www.jason-palmer.com/2011/05/how-to-reindex-a-solr-database/
    3. Recent versions of Solr have added a new export capability – the /export handler. This might prove useful.

Alternatives when a traditional reindex isn't possible

Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.

Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.

This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.

Note that if you're not already using Solr as a data source, then you'll have to re-index *twice* in order to utilize that method. Once to your intermediate Solr server(s), then from there to your server(s) that you're using for search.

How long does it take?

A full reindex is going to take AT LEAST as long as the initial indexing took. Unless you delete the index entirely before beginning, it could take even longer.

  • No labels