Google Summer of Code 2010 - Project Proposal

Project

Implementing a parser and an evaluator for Schema Component Designators

Student Name

Ishan Jayawardena

Email

udeshike@gmail.com

Time zone

UTC+5:30 (Sri Lanka)

Abstract

Apache Xerces2 is a high-performance, standard complaint processor written in Java for parsing, validating, serializing and manipulating XML documents. The objective of this project is to implement a parser and an evaluator for schema component designators (SCD) that can be used to identify and retrieve XML schema component(s) from the XML schema data model used by Xerces. Schema components that are defined in two W3C recommendations; XML Schema Part 1: structures[#1 ] and XML Schema part 2: Data types[#2] act as the building blocks of an XML schema document.

Description

W3C XML Schema Definition Language (XSD): Component Designators is a specification that reached W3C candidate recommendation in January 2010[#3] with W3C inviting the community to start implementing it[#4]. The main advantage SCD provides for the programmers is making it easier to navigate an XML schema object model more efficiently by reducing the amount of code that they have to write to retrieve a set of specific schema components. This is achieved by using a path expression similar to an XPath expression. The W3C SCD specification defines two basic types of SCDs[#5],

  1. Absolute SCDs (ASCD): An ASCD identifies a particular schema component; it consists of two parts: a designator for the assembled schema (a schema designator), and a designator for a particular schema component or schema components relative to that assembled schema (a relative schema component designator). Syntactically, an ASCD consists of a URI without a fragment identifier part which identifies the schema and an XPointer fragment identifier which encapsulates a schema component path (SCP)[#6] to designate a set of components in the context of that schema.


2. Relative SCDs (RSCD): An RSCD identifies a particular schema component relative to some current assembled schema; it is expressed as an XPointer scheme xscd() that uses a schema component path as the scheme data. This XPointer scheme may be used in combination with the XPointer xmlns() scheme.

For instance, consider the ASCD, http://example.org/schemas/po.xsd#xscd(/type::purchaseOrderType). In here, the URI http://example.org/schemas/po.xsd refers to an assembled schema and the XPointer fragment (which is an RSCD) xscd(/type::purchaseOrderType) refers to a particular schema component by using the SCP /type::purchaseOrderType. Following is an ASCD with a namespace binding, http://example.org/schemas/po.xsd#xmlns(p=http://example.com/schema/po)xscd(/type::p:USAddress). In here, xmlns(p=http://example.com/schema/po)xscd(/type::p:USAddress) represents an RSCD. The W3C SCD specification consists of a more comprehensive set of examples[#7][#8][#9] that illustrates a number of possible usages and types of SCDs/SCPs.

Please note that the term assembled schema (or schema or the schema description schema component) refers to a logical graph of schema components and these schemas may be physically represented as schema documents. In Xerces, the schema description schema component(i.e. the XML schema object model) is represented by the XSModel[#10] interface and the schema components are represented by the org.apache.xerces.xs interfaces.

In this project, I am focusing only on implementing the RSCD support for Xerces because according to the feedback I received from the Xerces community, it will often be more difficult and less useful to work with ASCDs given that there is no standard way to identify a schema by dereferencing a URI. The ability to resolve (i.e. to parse and evaluate) an RSCD comes from the ability to resolve a given SCP relative to a given context (i.e. either relative to a schema or to a schema component). Therefore, giving Xerces the ability to resolve SCPs (more specifically, non-canonical SCPs) is the main objective of this project and the RSCD support is implemented as a feature which uses it. There are couple of compelling reasons behind this.

  1. SCP is the main component in any ASCD or RSCD(but we are only interested in RSCDs)


2. SCPs have many usages; according to the W3C specification, SCPs can be used in contexts other than SCDs as long as proper namespace bindings are provided [#11]. For instance, we could use an SCP inside an XML element by properly binding namespaces


3. Another useful type of SCPs is the incomplete SCPs[#12]. An incomplete SCP can be evaluated against a given schema component to retrieve a set of schema components within it(i.e. similar to the way an RSCD is evaluated relative to a given schema, an incomplete SCP can be evaluated relative to a given schema component)


Therefore, it is highly desirable to come up with a more loosely coupled design in which SCP resolving capability is provided in a separate interface to serve potential requirements as well as to improve overall extendability and modularity. Following are the two primary operations that would reflect the RSCD implementation, and that would yield a number of SCD use cases[#13],

  1. to resolve a relative SCD. i.e. given a schema and an RSCD as the inputs, return a list of schema components.


2. to obtain the canonical SCP[#14] of a schema component (if available). i.e. given a schema component and the schema that contains the component along with the necessary namespace bindings as the inputs, return the canonical SCP


Based on these two operations and the incomplete SCP resolving capability, we can suggest following essential operations for the SCP interface.

  1. XSObjectList resolveSCP(String scp, XSModel schema, NamespaceContext nc)
    2. XSObjectList resolveIncompleteSCP(String scp, XSObject component, NamespaceContext nc)

Following third and fourth methods are useful when resolving the SCPs that do not involve a namespace binding. Such SCPs occur when the schema doesn't have a target namespace. For example /type::purchaseOrderType/model::sequence/schemaElement::shipTo is an SCP which does not use a namespace binding. 1.#3 XSObjectList resolveSCP(String scp, XSModel schema)
2. XSObjectList resolveIncompleteSCP(String scp, XSObject component)
3. String getCanonicalSCP(XSObject component, XSModel schema, NamespaceContext nc)


After considering time constraints applied on the project and the need for setting up more realistic and measurable objectives, I will only implement the first four methods and if time permits, I will also consider implementing the fifth method as well. But I have not mentioned any specific details about it in my project schedule.

The main components of the implementation are the SCP parser and the SCP evaluator which are going to be used extensively by the above methods. For example, in the first four methods, the parser parses either an SCP or an incomplete SCP and then this expression is processed by the evaluator to return a list of schema components in an XSObjectList.

At the initial stage, the parser and the evaluator is implemented to support only XML schema 1.0 object model and the system would be easy to extend due to the loosely coupled nature of its design, to support XML schema 1.1 object model as well. As I believe, speed and efficiency are the two most critical factors that must be met to a higher possible degree because the introduction of this new feature must not degrade the existing performance of Xerces under any circumstances. However, initially more attention is given to design a solid API and to come up with a more modular and extendable design as I mentioned earlier.

The parser can be generated with an automatic code generation tool similar to JavaCC and, to write the evaluator, a good understanding of the XML Schema API[#15] and an understanding about how to navigate an XSModel is required. The SCD W3C specification defines the EBNF syntax for both SCD[#16] and SCP[#17] which can be used in the generation of the parser. However, it does not suggest any semantics for evaluating such expressions.

Deliverables

  1. Source code and necessary build files for the SCD parser and evaluator
    2. Required patches if any
    3. A collection of tests that can be used to verify the functionality of the SCD parser and evaluator
    4. SCD API Documentation

Things I have done so far

I checked out and built the Xerces trunk and then I tried out some samples and tests and started to study the code, specially, the coding standards and styles that have been used and the package structure etc. Because I have no prior experience on using Java tools like annotations, packaging, unit tests and documentation generation, etc., I also started to learn them and I looked at existing issues of Xerces related to XML Schema API and searched if there are issues related to SCD in JIRA. I spent most of my time to research on SCD, specially to trying to understand the W3C SCD specification, to learn the background knowledge in XML Schema and XSModel, and to set up measurable goals for the project.

Development Schedule

I am planning to learn most of the required programming skills while doing the development. But initially (i.e. during the community bonding period) I will learn advanced Java skills and the required knowledge of XML schema and XML schema API since they are essential to start designing the components and to begin coding. I will dedicate the complete four-month period starting from April and lasting until the end of August for this project and I could work between thirty to forty hours per week.

Community Bonding Period: April 26 - May 24

Get to know the mentor and the community
Learning more about the required API and features
Preparing the development environment
Familiarizing myself with Xerces, XML Schema API and Java technologies etc.
Reading documentation about JavaCC
Start designing the system: this includes designing the required data structures and algorithms for the SCP parser and the evaluator, overall class hierarchy, and deciding where and how to implement methods of the API etc.

Interim Period: May 24 - July 12

Finalizing the API design
Dividing the development process into stages with the help of the mentor
Completing the SCP parser and together with its unit tests
Begin coding the evaluator (I believe the development of the evaluator will take more time than that of the parser and therefore I have allocated more time for it)

July 12 - July 16

Submitting mid-term evaluations and continue with the development of the evaluator

Interim Period: July 16 - August 9

Completing the evaluator and its unit tests
Completing the first four methods of the API by using the completed parser and the evaluator by arranging them as required to create the final system
Testing the evaluator with the parser
Start working on unit tests and documentation for the overall functionality of the system

August 9 - August 16

Refine code and unit tests, running complete tests, and improve documentation

August 20

Final evaluation deadline

August 30

Submitting required code to Google

Community Interaction

I have subscribed to both Xerces users list and development and I posted couple of times when I came across difficulties in installing and using Xerces. I also used the development list to introduce my interest in doing SCD as a project. Even before that, I tried to communicate with last year's GSoC mentors of Xerces in order to introduce my self to them and to ask about the possible projects for this year. Apart from that, I used the mailing list whenever possible to clarify the doubts by asking questions from the experts. Specially, the problems that I had about various aspects regarding the W3C SCD specification, expected results and possible design details of this project, internals of Xerces like XSModel and XSSerializer, etc. This knowledge together with the feed back that I received on my draft project proposal was so useful for me in creating this final project proposal. In the future also I'm expecting to use the mailing lists to clarify issues I find and to receive suggestions and feedback for my work from the experienced developers and to get them involved in the design process of the project as well. I'm also expecting to maintain an excellent communication with my mentor via email and IM.

About me

Hi, I'm Ishan. I'm an undergraduate of the department of Computer Science and Engineering, University of Moratuwa, Sri Lanka and my interests are XML and web services. What I expect from participating in a GSoC project is most importantly to get introduced to a large, well known community like Apache and to ultimately become a commiter of that project. I have a great passion to contribute to free software and therefore I believe this would be a great opportunity and an excellent starting point for that. With this project, I'm hoping to obtain a better understanding about the Xerces architecture by experimenting with it's code base and above everything, to implement a brand new feature for it that has just reached it's W3C candidate recommendation. At the same time, I'm hoping to improve my programming and communication skills and to learn more about XML, XML Schema, Java and similar technologies.

My experience in open source development: The first experience I had in open source development was writing a plugin for Mozilla Firefox web browser which was a visualizing tool for navigating and managing tabs. Then I attempted to contribute to KDevelop IDE by fixing a little bug in it. But I didn't receive a good feedback because it was considered an unwanted fix by the KDevelop community. Nevertheless, I could learn a lot of skills related to open source development by involving in that project, even if it was for a short time. I can code in C, C++, and Java. In addition to these things, I'm familiar with Linux and various command line tools. I always use free and open source software in my academic and development work and I encourage my colleagues to use free software alternatives whenever they can.

References and Resources

<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="f2f0e7bf-853a-4125-a815-cc8c28e1d2b9"><ac:parameter ac:name="">1</ac:parameter></ac:structured-macro>[1] XML Schema Part 1: Structures Second Edition: http://www.w3.org/TR/xmlschema-1/
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="dca57730-a935-4621-a4a6-cb300cd3a38c"><ac:parameter ac:name="">2</ac:parameter></ac:structured-macro>
[2] XML Schema Part2: Datatypes Second Edition: http://www.w3.org/TR/xmlschema-2/
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="c439cfbc-1b5f-4768-afe7-f70c57409379"><ac:parameter ac:name="">3</ac:parameter></ac:structured-macro>
[3] W3C XML Schema Definition Language (XSD): Component Designators: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="ffa31282-daee-4b62-aa96-d12b3b219afc"><ac:parameter ac:name="">4</ac:parameter></ac:structured-macro>
[4] W3C News Archive: http://www.w3.org/News/2010#entry-8703
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="efe39afc-ed3a-4f78-9cf6-f070eb96b18d"><ac:parameter ac:name="">5</ac:parameter></ac:structured-macro>
[5] Schema Component Designators: http://www.w3.org/TR/xmlschema-ref/#section-scds
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="527723f1-e97a-4a99-a878-48699fa647a3"><ac:parameter ac:name="">6</ac:parameter></ac:structured-macro>
[6] Schema Component Paths: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-path
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="30ae4397-2c22-4177-925a-e3fbcbf42645"><ac:parameter ac:name="">7</ac:parameter></ac:structured-macro>
[7] Extended Primer Example: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-primer-example
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="3ff565f0-3399-4369-a9af-945ca1eba114"><ac:parameter ac:name="">8</ac:parameter></ac:structured-macro>
[8] Additional Examples: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-example-more
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="fdb01441-0fbb-4240-97dd-0c9a7faff2fc"><ac:parameter ac:name="">9</ac:parameter></ac:structured-macro>
[9] Examples with component and elided-component Axes (Non-Normative): http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-examples-abbreviations
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="1b1ec13b-ef57-4ea2-a54a-ab471e39fedb"><ac:parameter ac:name="">10</ac:parameter></ac:structured-macro>
[10] XSModel(XML Schema API): http://xerces.apache.org/xerces2-j/javadocs/xs/org/apache/xerces/xs/XSModel.html
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="a2a6f66f-36f1-4e26-8db6-956556d58664"><ac:parameter ac:name="">11</ac:parameter></ac:structured-macro>
[11] See Section 4.3.2 Namespaces: http://www.w3.org/TR/xmlschema-ref/#section-path-interpret
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="d5842d7b-7288-4c5c-8a2e-b145c136aa25"><ac:parameter ac:name="">12</ac:parameter></ac:structured-macro>
[12] See Section 4.3.1 Incomplete Schema Component Paths: http://www.w3.org/TR/xmlschema-ref/#section-path-interpret
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="08ba05c0-1cb9-4cec-b105-a428e5ae9c6b"><ac:parameter ac:name="">13</ac:parameter></ac:structured-macro>
[13] Use Cases: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-usecases
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="91dd18a2-0cde-44f5-aa5d-818ad0bf78f0"><ac:parameter ac:name="">14</ac:parameter></ac:structured-macro>
[14] Canonical Schema Component Paths: http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/#section-canonical-path
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="06e84987-3b1f-4908-aa4e-b04d1fb09952"><ac:parameter ac:name="">15</ac:parameter></ac:structured-macro>
[15] XML Schema API: http://xerces.apache.org/xerces2-j/javadocs/xs/index.html
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="ebec1117-b36f-4f38-b3aa-b6e85acba3b7"><ac:parameter ac:name="">16</ac:parameter></ac:structured-macro>
[16] Schema Component Designator Syntax: http://www.w3.org/TR/xmlschema-ref/#section-scd-syntax
<ac:structured-macro ac:name="anchor" ac:schema-version="1" ac:macro-id="179a1822-47d8-49f3-bf54-b4045616485c"><ac:parameter ac:name="">17</ac:parameter></ac:structured-macro>
[17] Schema Component Path Syntax: http://www.w3.org/TR/xmlschema-ref/#section-path-syntax

  • No labels