Google Summer of Code 2006 – Project Proposal
Implement an LRU-based cache manager for Derby
- Karthik Thyagaraja
- karthikt at ufl dot edu
Øystein Grøvlen (oystein.grovlen at sun dot com)
A good page replacement algorithm is very important for a Database Cache Manager.It decreases the response time & increases the speed with which results are delivered.The LRU based cache manager will try to replace the currently used Clock based Cache Manager in Derby.The clock algorithm has sub-optimal performance in many cases.The new cache manager aims to give a good performance(almost optimal) under most of the situations.Having a good buffer manger will lower the number of Disk I/O's & thus lead to improvement in the overall system performance.This would be a valuable addition to the Apache Derby Database.
An important part of any database system is the buffer/cache manager.It is the interface between main memory and disk.The database buffer is divided into equal sizedpages so that data can be exchanged between the disk and the main memory.Typically the page size & the buffer size are parameters in the catalog that remain fixed once the database starts running.To change this parameter the catalog has to be updated and the database has to be restarted.Since a physical access to a page on disk is more expensive than an access to a page in the cache, the aim of the database buffer manager is to minimize the number of Disk I/O's incurred. When a page is requested from the buffer the following things have to be done:
- Search if the page requested already exists in the buffer
- If it exists, then service the request from this page.
If it doesn't exist(This is called a page fault),then fetch it from the disk & add it into the buffer.If the buffer is completely filled then some of the pages have to be replaced in order to accommodate the new request.The decision of:Which page to be replaced? is taken by the page replacement algorithm.
So page replacement algorithm plays a crucial role in the performance of the cache manager.An ideal replacement algorithm has least number of page-faults.The various types of page replacement algorithms are: NRU(Not-Recently-Used), FIFO(First-in-First-Out), Clock Algorithm, LRU(Least Recently Used). Currently Derby uses Clock algorithm for its Cache Manger.Clock Algorithms are more suitable for handling virtual memory in operating systems than for database buffers because in a database cache the amount of work associated with each page access is much larger.Clock algorithms don't perform optimally in many situations.LRU(Least recently algorithm) policy has been proven to be a better algorithm for page replacement.It has been widely used & studied and is amenable to statistical analysis.It has been proved that LRU can never result in more than N-times more page faults than theoretically optimal page replacement algorithm ( N is proportional to the number of pages in the buffer).
Naive LRU Replacement Algorithm: It works on the principle of locality.In this approach each page is assigned a time stamp or some other time based ID.On a page fault the page which hasn't been accessed for the longest time is replaced.Some drawbacks of this naive method is that :It is unable to differentiate between pages that have relatively frequent reference and pages that have very infrequent reference.Once a page has been read into the buffer it stays for a long time in the buffer even if it isn't referenced anymore. LRU-K algorithm: It is an enhancement to the LRU algorithm.It takes into account the access frequency & history of each page in order to make better decisions while replacing a page. Advantages of this policy are:
- It better distinguishes between page sets with different levels of reference frequency
- It adapts well to the changing access patterns
- It is simple and doesn't need lot of bookkeeping
- "It detects locality of reference within query executions,across multiple queries in the same transaction, and also locality across multiple transactions in a multiuser environment."(Elizabeth et.al.)
This technique is discussed in detail in this paper: http://citeseer.ist.psu.edu/16869.html
Issues to be considered while designing:
- Concurrent transactions
- The type of page:Ex Index page or data page
- Type of file organization: Sorted,Heap or B-tree
- Type of query algorithm(Ex Some types of Joins may need to be handled differently)
In this project i aim to:
- Design and implement a LRU-based Cache Manager for Derby.It will use the LRU-K algorithm.
- Develop test cases(Unit testing)
Analyze the performance of the cache & compare it with that of the clock algorithm
Some implementation details
The project will be implemented in Java. The cache manager can be implemented as a doubly-linked list of memory pages.We will have two pools:An Used and an Unused pool of pages.This way it is easy to service any request for a page.
An AVL Tree can be used to store the details of each page like it's timestamp,reference to the page etc.An AVL tree is more suitable because it has lookup, insertion, and deletion operations of O(log n) in both the average and worst cases.
Project Timeline and Deliverables
Understand the overall system architecture & understand Derby Code as appropriate for the cache manager
- Come up with a Design for the Cache Manager
- Final design ready for implementation after complete approval of mentor
-May 30-June 30
- Implement LRU algorithm
-June 30-July 10
- Unit tests
Performance evaluation & comparison with the previous Clock algorithm
-July 20 onwards
Additional enhancements for different types of loads like Nested Loops Join,assigning different priorities depending on the type of the page.(Ex. A B-tree root node should have a very high priority & should not be removed from the cache)
-Documentation -August 19th
- Final Project submission to Google
I am a second year Masters student(MS in Computer Science and Engineering) at Univeristy of Florida(UFL),Gainesville,USA.I had taken a Database Implementation course(Grad Level Course) in which i learnt in detail the workings of a typical relational database system.I have also designed and implemented a single user relational DBMS.I single handedly coded about 11K lines of C++ code for the DBMS.The components of the DBMS are Buffer Manager,Sorted & B-tree file organizations,Query Compiler,Query Optimizer & Query Execution engine.It supports Selection,Projection,Join,Multiply,Remove Duplicates and Aggregate Functions.
Some significant highlights of the project were -
I used gprof to speed up the external sorting(Two-phase,Multiway Merge-sort).I reduced the time it took to sort by 50%.I accomplished this by bypassing object comparisons & instead doing direct memory comparisons.
Used Dynamic programming to speed up & lessen the memory usage of the brute-force "Enumerate All Possible Logical Query Plans" algorithm.
I have about 3 years of good Java experience.I have coded most of my academic projects in Java.Some big projects that i have built using Java are:
- An Interpreter for a functional language called RPAL
A Hierarchical Group Communication System(extensive use of threads & dealt with issues like concurrency)
- Implemented a modified 7-stage pipeline simulator for reduced MIPS instruction set architecture.
I am very excited about this project.Having already built a single user DBMS and having implemented a buffer manager that makes use of LRU algorithm, i am very confident that i will be able to do an excellent job on this project.A chance to contribute to the Open Source community makes me feel elated.
I am just taking 3 research credits this semester.So i will be free most of this summer.
Principles of Database Buffer management http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=2022
Database Systems - The Complete Book by Hector Garcia Molina et.al.