Guidance on Using Filters for Increased Efficiency

(The following is mainly reused content from the thread starting at Re: fq vs. q - Michael Ludwig on the org.apache.lucene.solr-user mailing list.)

What goes in the `q` parameter and what goes in the `fq` parameter?

How do I decide, when writing a query, what criteria goes in the q parameter and what goes in the fq parameter, to achieve optimal performance. Is there some kind of rule of thumb to help me decide how to split things up when querying against one or more fields?

Understanding Solr's caching system

First, some background is necessary to understand SolrCaching, notably the queryResultCache and the filterCache. You should be familiar with the different caches Solr has in order to make informed decisions on filter usage.

You will then know that the first occurrence of a filter query results in a filter (implemented as a bit vector), being cached within Solr. A separate filter cache entry is made for each fq argument in your query. Each fq involves a complete search of the index.

Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection.

So how can we put this to practical use? Well, we have to decompose our queries and take note of the frequency of each query part. If we know how often certain query parts arise, or at least have the means to collect that data, we know what might be candidates for filtering.

Thinking about our queries analytically

Now how do we know what query parts recur frequently?

Well, we know the application we're writing, so we either know the frequency of a given query part based on the usage our application makes of Solr and on the restrictions it imposes on the user by, say, using DisMaxRequestHandler; or - if we give the user fine-grained control over the query language - we may somehow collect and analyze the actual queries in order to empirically determine actual search engine usage and query part frequency and optimize accordingly.

Anyway, we need to analyze, to decompose the queries we want our system to handle. We'll then know the query parts and the frequency of the various combinations, and we'll then see what are good filter query candidates.

Analysis, Scoring and Faceting

Each field type in Solr can define a index-time and query-time analyzer. When a query is specified through the 'q' parameter, it is parsed and the token is analyzed through the query-time analyzer for that field's type. For example, 'string' type fields are not analyzed but 'text' type fields are analyzed. However, any query specified as a filter query through the 'fq' parameter is *not* analyzed regardless of the field's type. Therefore, filter queries are best suited for filtering on exact matches or range searches. If you need to filter but the search query needs to be analyzed, it should be specified through the 'q' parameter.

Filter queries, as the name suggest, are used for filtering (or drilling down) the result set. Therefore, the results of a filter does not need to be ordered and hence it does not participate in the scoring. Queries specified in the 'q' parameter contribute to the scores.

Filter queries are great for filtering result sets based on facets. Faceting is performed on the indexed tokens and therefore the value of any facet can be directly used in the 'fq' parameter since it does not need to be analyzed.

An example to illustrate the greater efficiency obtainable by filtering

Filtering a given query result R on bla:eins, bla:zwei, bla:drei or bla:vier is very common in my application. So while I could include this criterion in my main query (q) and hope for the queryResultCache to kick in, this would likely be inefficient as my primary query, which gave me R, likely varies a lot, resulting in a high number of distinct queries, with relatively low probability for a given query to occur frequently. So each of these query result sets would enter the queryResultCache as a distinct set, hence high contention, high eviction rate, poor cache efficiency.

Enter filter queries. I'm going to factor out those bla:eins (etc) filters from my primary query (q) and put them in the filter query (fq). The benefit is double:

(1) Solr has a dedicated cachespace for filters the usage of which I control by my usage of the filter query (fq). I can set up things so the usage of the primary query (q) is under the user's control while the usage of the filter query (fq) is under my application's control. I control this cache, I ensure its efficiency, by allowing only frequently used filters to enter the cache, and by not allowing so many filters access that high contention and eviction in the filterCache would ensue.

(2) Factoring out the filter query bla:eins (etc) from the primary query also reduces variation in the primary query, thus making the queryResultCache more efficient.

So instead of having, say, 10000 distinct primary queries, no usage of the filterCache, and poor usage of the queryResultCache, I may have only, say, 3000 distinct primary queries, four cached filters in the filterCache (bla:eins etc), and a somewhat better usage of the queryResultCache.

Stray bits

Memory consumption per filter field value is not a great concern here as the filterCache stores relatively compact representations such as bit vectors (or smaller representations, depending on the set cardinality) with each bit representing a boolean to signal whether or not the document in question is a member of the set matching the filter specification. Document reference is implicit by each bit's position in the vector; this is by virtue of the fact that Solr's internal document IDs (which are different from the document IDs the user may assign via the <uniqueKey> in schema.xml) are a sequence of consecutive integers.

If my filter query result comprises more than 50 % of the entire document collection, its selectivity is poor. I might need it despite this fact, but it might also be worth while thinking about how to reframe the requirement, allowing for more efficient filters.

What varies heavily should probably not go into the filterCache. For example, a geodata search window (longitude and latude) varying over a huge valuespace with each user action looks like a candidate for the main query (q), whereas some other criterion not subject to such frequent change and relatively limited in its valuespace looks like a candidate for the filter query (fq).

How does the facet.query parameter fit in with q and fq parameters? It is also some kind of query, isn't it? That's true; but as the name suggests, it is for faceting, not for filtering. A facet.query is just a more flexible means of defining a filter than possible using a mere facet.field. While the q and fq affect the results portion of a search response, the facet.query only affects the facets portion of a response. The facet.query is only used where you want a facet summary of your query based on some kind of complex expression rather than the terms within a single field, as with facet.field. It is not used for filtering the result set, but to obtain faceting data in addition to result set.

Configuring the `filterCache`

If I know that only 100 filters are possible, there is no point raising the filterCache/@size above that threshold. But it may not be harmful either.

Given the following three filtering scenarios of (a) x:bla, (b) y:blub, and (c) x:bla AND y:blub, will I end up with two or three distinct filters? In other words, may filters be composites or are they decomposed as far as their number (relevant for filterCache/@size) is concerned? In this example, (a), (b) and (c) are three distinct filters. If, however, (c) was specified using two distinct fq parameters x:bla and y:blub I'd end up with only two distinct filters for (a), (b) and (c).

Note that faceting by facet queries (but not by facet fields) also uses filters under the hood. In fact, facet queries are just an application of filtering. So take that information into account when thinking about what could be a good value for filterCache/@size.

What happens when the filter is full?

What happens when the filter is full? If there any accounting of which cache entries are getting the most or most recent hits? A good question, which remains to be answered.

Further reading

More information on filters (and other topics) can be gleaned from a lengthy, but very good article on scaling Lucene and Solr.

FilterQueryGuidance (last edited 2009-09-20 22:05:02 by localhost)