Tiago Espinha - Google Summer of Code Projects
Project |
DERBY-728 - Unable to create databases whose name containg Chinese characters through the client driver |
|
Student |
Tiago A. R. Espinha |
|
Mentor |
Kathey Marsden |
|
tiago@espinhas.net (personal e-mail) |
||
tiago.derby@yahoo.co.uk (Derby matters) |
||
IM |
tiago@espinhas.net (Google Talk/MSN) | etiago (IRC) |
Contents
Google Summer of Code 2010
Abstract
Apache Derby relies on the open standard Distributed Relational Database Architecture (DRDA) to implement the abstraction between SQL and a standard DRDA language. Its implementation on Derby is currently limited to ASCII characters.
There is an actual and current need to support Japanese and Chinese characters as requested by the community. My task will be to refactor and improve the code so that these characters are supported by the DRDA engine on Derby.
Application
Apache Derby is a RDBMS entirely developed using Java. Due to its small footprint and also due to its flexibility and ability to be deployed in a multitude of environments, Derby is one of the best open source alternatives currently out there. The fact that it is the foundation for Sun’s Java DB clearly demonstrates this. As a passionate software developer with a strong educational background on database administration, Derby called my attention.
Last year I participated and successfully completed a GSoC project with the Apache Software Foundation. It was an incredible and enriching experience that allowed me to connect with the community and have a better grasp of how it functions. My project consisted of creating unit tests (converting them into a new standard) and help with the always on-going bug fixing. From a technical point of view, this experience taught me a lot.
These past contributions to Derby are still available at [1].
I believe that my previous experience with Derby will help me succeed again this year. My mentor (Kathey Marsden) offered to mentor DERBY-728 [2] and suggested that I apply to this project: rolling in support for Chinese characters through the client driver [2][3].
Right now when the client driver is used, the requests are piped through the DRDA engine. In Derby’s implementation of DRDA, the requests are encoded using EBCDIC [4] and this encoding uses an 8-bit representation which limits the number of characters it can represent by 256. This limitation is fine when it comes to US-ASCII characters (a sub-set of EBCDIC) but it does not encode the thousands of Chinese and Japanese characters. For this, we require a broader encoding such as UTF-8. Since backwards compatibility is always an issue, we must also ensure that not only the new character encoding is put into place, but that the older encoding types are still supported.
There is currently an Architecture Change Request (ACR7007) [5] with The Open Group undergoing fast track review to make this change an actual component of the DRDA specification. This ACR proposes that an encoding is agreed at the EXCSAT stage between the Application Requester (AR) and Application Server (AS). This encoding can then be the default EBCDIC or UTF-8 for the added range of characters. It is this encoding that is then used for commands following the ACCSEC (which is still negotiated using the normal EBCDIC encoding).
In the meanwhile, I have also setup my build environment and I have also taken on a smaller task [6] that will help me build up to the main one. According to my mentor, this project would ideally be undertaken by someone with previous experience in contributing to Derby and as such, I qualify for the task. Also, if my project turns out to be ahead of schedule and I finish early, I will also continue my last year project by assisting with the other issues.
I am still passionate about developing software and as a fresh graduate I can also use all the experience I can get. This program provides students with that experience and I am thrilled to be a part of it again.
At this point I am preparing to write the dissertation for my Master’s degree in Advanced Software Engineering and I have no other time-consuming commitments for the duration of the GSoC program. I am eager to take on this project and I look forward to work once more with the Apache Derby community.
Deliverables
- Incremental patches to roll in support for UTF-8 on the client driver
- New test cases to verify that characters beyond EBCDIC are indeed supported
Schedule
Section I - Problem Analysis
- 1st of April
- Studying the DRDA specification (chapter IV) to gain perspective on how it works.
- Studying the ACR7007 that specifies how the UTF-8 support should be integrated into the DRDA engine.
- 8th of April
Fix DERBY-4584 as an ancillary issue related to DRDA and non-EBCDIC characters.
- Starting to study the existing prototypes for this implementation.
Section II - Implementation
- 30th of April
- By this date I expect to have an understanding of the prototype.
- 22nd of May
- Implement tests to analyze down the line whether the implementation has been done correctly.
- These tests will try to use UTF-8 characters in the RDBNAM, USRID and PASSWORD fields and fail whenever the server throws an error.
- Tests are also needed to ensure that the length cap is maintained on these fields. When only the EBCDIC implementation was available, 1 byte equalled 1 character. However, with UTF-8 in play, a character has variable length and can take up to 4 bytes. This means that the identifiers can no longer have 255 characters but only 255 bytes.
- 12th of June
- By this date the first draft of the structure needed for the UTF-8 encoding should be in place.
This will require an Utf8CcsidManager that mediates the encoding from and to UCS2.
It will also require changes wherever necessary to make sure that the CcsidManager is used.
- 3rd of July
- After the aforementioned task is done, I will be implementing a switch mechanism that will choose whether to use EBCDIC or UTF-8.
- This choice is based on the EXCSAT command and the level that has to be negotiated between the client (Application Requester) and the server (Application Server).
- The encoding used will be the lowest common denominator between these two, as to maintain backwards compatibility.
- I will also need to enforce the 255 byte cap to satisfy the DRDA specification.
Section III - Testing and deployment
- 9th of August
- Check that backwards compatibility is maintained.
- This will entail testing the trunk server against a 10.1 client and a 10.1 server against a trunk client.
- Check that the identifier cap is in place.
- Check that UTF-8 characters are indeed supported.
References
[2] DERBY-728 (“Unable to create databases whose name containg Chinese characters through the client driver”)
[3] DERBY-4009 (“Accommodate length delimited DRDA strings where character length does not equal byte length”)
[4] EBCDIC Table
[5] ACR7007
[6] DERBY-4584 (“Unable to connect to network server if client thread name has Japanese characters”)
Google Summer of Code 2009
DerbyTestAndFix
Student: Tiago Espinha
Mentor: Kathey Marsden
ERROR XJ073: The data in this BLOB or CLOB is no longer available. should include the possibility that the lob has been freed |
Done |
|
Convert "org.apache.derbyTesting.functionTests.tests.store.holdCursorJDBC30.sql" to junit. |
Done |
|
Convert "org.apache.derbyTesting.functionTests.tests.store.holdCursorExternalSortJDBC30.sql" to junit. |
Done |
|
The javadoc for SpaceTable refers to an alias that doesn't seem to work |
Done |
|
Provide the ability to run tests concurrently on the same machine |
In progress |
|
Provide the ability to use properties with ij.runScript() |
Done |
|
Make the default port for the suites.All run configurable with a system property |
Done |
|
Convert derbynet/runtimeinfo to JUnit |
Waiting commit |
|
OFFSET and FETCH FIRST documentation improvement |
Done |
Side-work for the community:
Apache Derby screencast on how to set up the development environment