Implementation of a LRU cache manager for Apache Derby
This is a Google Summer of Code project proposal. The goal of the project is to implement, test and integrate a cache manager for Apache Derby which uses the Least Recently Used principle when evicting data blocks from the cache.
- Anders Morken
Derby is a full-feature relational database management system implemented in Java. Currently its cache manager uses a clock-based algorithm when evicting pages from the cache. The primary goal of this project is to build a cache manager using a LRU eviction policy which can be dropped into Derby instead of it's existing clock cache manager. This page has more information from the mentor.
- Possible performance improvements - the clock algorithm is more commonly used in virtual memory subsystems in operating systems due to its low per-access overhead. However, most database systems use LRU-based cache managers because they provide a more intelligent page eviction choice at a cost which isn't unreasonable given the already high cost of page access. A successful LRU cache implementation which correctly handles sequential scans might prove better than the clock algorithm.
- Demonstration of Derby's modular architecture - Derby is designed and implemented as a set of modules with defined interfaces. An internal API interface can have multiple implementations, and Derby's Monitor subsystem can choose between them at Derby boot time.
- Configurability - some workloads may be suitable for a LRU cache, others for a clock cache. Derby's choice of a cache manager could be configurable if this is desired.
Plan of approach
- As there is a known good Clock cache manager in Derby already, work on the LRU cache manager will start from that. Clock's bookkeeping and eviction logic will be ripped out and replaced with the needed logic for a LRU cache.
- Specifically, instead of a "recently used" flag which is cleared by a periodic walk through the cache, the cache items will maintain a linked list of themselves in "order of last use".
- Initially, the implementation will be rather naive. No attempts at identifying special access patterns (such as sequential scans) will be made. As Derby's cache manager is used by more than just the page cache, access pattern detection might or might not be desirable.
- A look at other database systems' buffer management algorithms could yield useful insight into what direction Derby should go.
- Code in Derby that performs sequential scans will be identified and investigated, and a decision on how such scans should be handled will be made with assistance from the mentor and other project participants. Possibilities include bypassing the cache entirely on misses, or extending the cache API to allow the cache user to specify a lower or higher priority for different types of data. Making the cache manager itself identify seq. scans is probably difficult and error-prone, and we already have perfect knowledge of that in Derby code outside the cache manager.
Instead of making a home-made extension to LRU with priorities, a buffer management algorithm which is resistant to scanning behavior, such as "2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm." could be investigated. It doesn't seem to be overly complex or expensive, but is specifically designed for buffer management, and may not be the right choice for other caches such as the statement cache.
- In light of the constraints detailed below, this is a tentative schedule.
- May 23rd: Tentative project start if accepted by the Apache and the Summer of Code program.
- June 5th: Initial delivery of preliminary patch with a simple LRU-ified version of existing cache manager, for review. derbyall test suite should pass using this cache. Deadline probably achievable if work is started ahead of SoC official start.
- June 26th: Delivery of improved LRU cache patch, closer to production quality. Should be able to replace existing cache manager and pass derbyall test suite with flying colours.
- July 3rd: Interesting performance measurements and scenarios identified, tests implemented.
- July 8th: Delivery of performance evaluation of different caches.
- July 17th: Improved cache manager patch taking lessons learned from performance testing into account delivered.
- August 3rd: Delivery of investigation of sequential scan handling, possibly patches.
- LRU-based Cache Manager patch for Derby - including unit tests.
- Performance reports.
- Investigation of bad performance cases, i.e. seq. scans.
- Possibly: Patches to other parts of Derby if required to handle seq. scans. TBD.
- Project starts: May 23rd, ends: August 21st.
- All of Anders' exams this semester are scheduled after the SoC program start on May 23rd, the last on June 10th. This limits productivity in the start of the program.
- Anders may or may not have a dayjob in the summer.
- The mentor is probably on vacation in July.
Anders and his girlfriend need to visit each other's parents for two weeks in August.
- Started playing with Derby around christmas 2005, submitting minor patches and getting acquainted with the code base and community.
- Has a large school project on Derby and SMP scalability scheduled for the next semester.
Works part time as systems administrator at the university, among other things as firstname.lastname@example.org.
Interested in databases, performance and scalability challenges, hardware/software design. And skydiving.