Internal Data Structure Development
Any data structure development in Blur needs to have a manageable memory footprint. The easiest way achieve this behavior is to make the data structure file based (through the Lucene Directory API). Implementing a file based data structure will make use the block cache directory which will automatically cache blocks of the files in use.
If a data structure cannot be written to use the file API then a manageable memory model will have to be implemented.
The reasoning for this is development strategy is three fold:
- JVM Heap limitations
- Data grow issues
- User query requirements
JVM Heap limitations
The two main limitations are Garbage Collection, and overall size. The overall size of the heap currently is limited to around 16 GB of heap (assuming that you are NOT using a Zing, or an Azul appliance). There many, many blogs discussing the limitations of the GC and the JVM heap.
Data grow issues
A goal that most clusters is to have enough RAM to hold the hot portions of the index in memory. However in some situations it may be required to load more data into a system then is recommended. This will cause the caching system to miss more often, however the system will continue to operate. If the same situation occurs with naive fully loaded in Heap data structure the cluster could fail with the normal "Out Of Memory" exceptions.
User query requirements
User queries for the most part are short lived and require minimal amounts of heap space the big exception is sorting. These queries require the ordering field(s) to be loaded into memory. Many improvements have been made in Lucene 4 when it comes to field caching, but the default implementation loads the entire field contents into the heap. In addition to the on heap version, Lucene offers a separate implementation that will read the field contents from files (Directory API), this should be the implementation that Blur will use to perform sorting.
NOTE: These features in Lucene 4.0 are call Column Stride Fields.
The next largest memory consumer for user queries is filter caching. For the most part this is accomplished through weakly referenced bit sets that represent the filter the user requested. A file based solution has not yet been implemented, but should be.