Google Summer of Code 2009 - Project Proposal

Project:

Adding Unicode Normalization support to Xerces2-J

Student Name:

Richard Kelly

Email:

rakkie at gmail dot com

Time Zone:

Australian Eastern Standard Time (UTC+10)

Abstract

This project will design and implement support for Unicode character normalization and normalization checking in Xerces. Applications that use Xerces will be able to produce fully normalized XML documents and verify that any XML documents they process are fully normalised. Documents that have been verified to be fully normalized can have string comparision operations performed on them without having to worry about the many possible forms (with the same meaning) allowed by Unicode.

Adding this functionality will allow Xerces to fully meet the XML 1.1 W3C Recommendation and allow it to implement the optional normalization checking features specified in the DOM Level 3 Core. This functionality will be implemented in two stages with testing and documentation supplied for each stage.

Scope

Initially this project will be limited to the Java version of Xerces, although it may be ported to other language versions at a later stage.

This project version will use the composition mappings from the Unicode. These mappings are guaranteed to be the same on future Unicode releases later than 3.1 but it will not be compatible with versions prior to that.

Background

Unicode is a standard that represents most of the world’s writing systems. The latest version is currently 5.1.0 and contains over 100,000 characters 1. Unicode has been adopted by the XML standard and all XML processors must support at least Unicode character encodings UTF-8 and UTF-16.

Characters in Unicode may have different representations but have an equilavent meaning to another representations. Once characters have been fully normalized string comparisons can be performed regardless of the representation used. Full details of Unicode normalization can be found in Unicode Standard Annex #15 2.

The XML 1.1 W3C Recommendation 3 states in section 2.13 that all XML parsed entities should be fully normalized. Currently Xerces does not perform any checks to make sure this occurs. This project aims to allow you to perform normalization checking when parsing documents with Xerces. In addition, it will implement the optional normalization features in the DOM parser as described in the Document Object Model Level 3 Core 4 and Load/Save 5 specifications.

Approach

This project requires two basic types of functionality:

  1. Character normalization - this takes a Unicode character as input and returns a fully normalized version (Normalization Form C) of that character as output.
  2. Normalization checking - this takes a Unicode string as input and returns true if the string is fully normalized or false if it is not.

Since the code between these functions can mostly be shared, the code can be implemented as a single XNI component. This allows it to be easily plugged into the pipeline of any XML parser or called by the DOM parser when necessary. Changes to the parsers to utilize this component would include:

SAX Parsers:

  1. Define a Xerces' specific feature URI for normalization.
  2. Invoke the XNI component when the above feature is enabled and update the document with the normalized text.
  3. Invoke the XNI component to perform normalization checking when the "unicode-normalization-checking" 6 feature is enabled.

DOM Parser:

  1. Update the normalizeDocument() function to invoke the XNI component when the "normalize-characters" configuration flag is set to true, and update the DOM with the normalized text.
  2. When loading documents into the DOM with LSParser invoke the XNI component to perform normalization checking but not when the document has already been certified (i.e. LSInput.certifiedText is true).

To convert or test a fully normalized character three steps are needed:

  1. check if a character is already in fully normalized form or not
  2. if not, decompose the character into its constitute elements (a set of Unicode code points) – this is known as Normalization Form D (NFD).
  3. compose these constitute elements into a normalized character – this form is known as Normalization Form C (NFC).

This process leads itself to be logical separated into parts: decomposition and composition. The implementation of each part will form the basis of milestones in this project.

To ensure this normalization checking is thoroughly tested, the Normalization Conformance Test included in the Unicode Character Database will be used. This test file will be converted into a set of unit tests to ensure that the XNI component fully conforms to the Unicode normalization standard.

If time permits various optimizations can also be implemented such as multi-stage tables for quick lookups of mappings, implementing algorithmic conversion for Hangul characters, and adding preprocessing to minimise calls to slower code paths.7

Development Schedule

Before May 23rd

Participate in mailing list discussions, research, become familiar with the code base.

May 23rd to May 29th

Design and create necessary interfaces and structures, XNI component

May 30th to June 6th

Implement decomposition functionality (Normalization Form D)

June 7th to June 20th

Australian Exam Period *

June 20th to July 1st

Implement decomposition functionality (Normalization Form D) (continued)

July 2nd to July 5th

Testing, tidying up documentation, bug fixing

July 6th

MILESTONE – XNI component with Normalization Form D functionality complete and tested

July 7th to July 25th

Implement composition functionality (Normalization Form C)

July 26th to July 31st

Integrate XNI component into existing DOM parser

August 1st to August 5th

Create a set of unit tests based on the Normalization Conformance Test

August 6th to August 10th

Testing, tidying up documentation, bug fixing

August 10th

MILESTONE – Final code & documentation complete & tested

August 11th to August 17th

Overflow period in casse any activities take longer than expected

* Unfortunately the Australian end of semester exams fall within the Summer of Code period. This will take me out of action for two weeks (from the 5th June to the 19th of June). In order to compensate for this I will arrange with my mentor to either (a) start coding two weeks earlier or (b) work an extra 10 hours per week for the following 8 weeks. Other than this period I have no other commitments so I am free to work on my code.

Deliverables

Community Interaction

I have joined the Xerces Developer mailing list and will use this for open discussion to gather suggestions for my project and to find solutions to any problems I encounter. In addition, I will report to my mentor every other day. I am also learning how to use the issue-tracking system (JIRA) to track and follow relevant issues. Most importantly, of course, I will do my best to evaluate and incorporate any suggestions from the feedback into my project.

About Me

Hi, my name is Richard. I’m currently studying a Masters in Information Technology at Monash University in Australia. I have two Bachelor degrees, one in IT and another one in Arts focusing on Philosophy and Cognitive Psychology. I’m also (slowly) learning Korean and Chinese in my spare time.

Since I started learning other languages I’ve really begun to appreciate the usefulness of Unicode and I’ve been making all my own programs support it. One of my programs required me to use Unicode normalization to correctly process Korean input. I found this area fascinating and think that this project.

My studies are largely based on Java and I am currently studying web services so I am quite familiar with XML and DOM. While my knowledge of Xerces architecture is a little weak, I am attempting to address this by going through the design documents on the Xerces web site and by playing around with the source code.

I love to code and have been writing homebrew programs for a number of years. I use open source software wherever possible and think Google Summer of Code is a great idea. I would like to contribute code to something bigger and think that the Xerces project is a great match for me.