Proposal Title

[XERCESJ-1429] [GSoC]: Asynchronous LSParser and parseWithContext

Student Name

Thiwanka Somasiri

Student E-mail

asthiwanka@gmail.com

Organization/Project

The Apache Software Foundation / Xerces2-J

Assigned Mentor

Michael Glavassevich

TimeZone

GMT + 5.30

Proposal Abstract:

Apache Xerces2-J is a high performance and fully compliant XML parser written in Java to parse, validate and manipulate XML documents. The goal of this project is to complete the implementation of the DOM Level 3 LSParser. It has to focus on two areas that are still to be developed according to w3c recommendation. Namely : Asynchronous LSParser and parseWithContext

Detailed Description:

The purpose of this project is to address two important features that are still not developed in the Apache Xerces-J according to the Document Object Model Load and Save section in w3c recommendation [1]. Those tasks are as follows:

1. Implementation of Asynchronous version of LSParser

2. Implementation of parseWithContext() functionality

A discussion on the high-level design of the project is carried out now onwards.

1. Implementation of Asynchronous version of LSParser

LSParser is an interface to an object that can build or augment a DOM tree using various input sources. It has an API for parsing XML and building the corresponding DOM structure [2]. But Xerces-J is lack of an asynchronous version of LSParser which returns from the parse method immediately and builds the DOM tree on another thread.

High level implementation details for the asynchronous mode for LSParser

According to the specification, an LSParser to be asynchronous, it has to be extended by EventTarget interface. In Xerces-J, the actual implementation for LSParser resides in org.apache.xerces.parsers.DOMParserImpl class. As a result, the asynchronous version of the LSParser can be implemented as an extension of DOMParserImpl. Extending EventTarget interface results in three method implementations in the DOMParserImpl as denoted below.

1. void addEventListener(String type, EventListener listener,boolean useCapture)

This method helps the parser to allow register event listeners on the event target. When registering EventListeners there is a particular constraint specified in the w3c recommendation, not to allow multiple EventListeners to register in the same EventTarget with same parameters. To resolve this we can use a proper data structure (probably a HashSet in Java) and manipulate it in DOMParserImpl. Yet another implementation aspect is to avoid triggering EventListeners if they are added while EventTarget is in process.

2. void removeEventListener(String type, EventListener listener, boolean useCapture)

This method is simply the other way of addEventListener().

3. boolean dispatchEvent(Event event)

This method allows the dispatch of events into the implementations event model. The return value of this method indicates whether any of the EventListeners who handled the ‘event’ called preventDefault() method. If it is called the return value is “false” and “true” if not [8] [9].

Asynchronous LSParsers supports two such events. They are,

1. Load

This event indicates the completion of loading document by LSParser and,

2. Progress

The LSParser signals progress as data is parsed. The signaling process is completely implementation dependent since the specification does not define when the progress events should be dispatched.

How the main thread and the thread that the asynchronous parser runs on interact -

In asynchronous LSParser, as the parse() method is called it may return with ‘null’ immediately, because at this point the Document may have not been constructed. But the parser may continue parsing and at some point where it completes loading the Document, an LSLoadEvent is fired at the EventListener. The thread who invoked parse() method (say main thread) can continue with other work while the asynchronous LSParser is busy on another thread with creating the Document. The synchronous parser does the opposite of this staying to return until parsing has ended (existing DOMParserImpl.java class specifies this scenario).

How the ‘busy’ flag’s value should behave -

Although the asynchronous parser returns with "null" immediately after main thread (or any other thread) invoke parse() method, the parser should remain in busy state. Once it has finished loading the Document, it can free the "busy" flag and give chance to any other thread who is waiting to invoke parse() in the asynchronous parser.

What will abort() method will do after invocation -

When abort() method is called at a time the parser is busy, it should prevent loading the document. If the value of the "busy" flag is 'true', I have to set it to false and interrupt the current Thread (which runs the asynchronous parser). If the "busy" flag is 'false', abort() method will do nothing.

The following diagrams depict an example for how progress events are dispatched.

http://farm6.static.flickr.com/5068/5594300706_da7d1ddd88_z.jpg

Figure 1 - Progress Event when parser starts receiving data

http://farm6.static.flickr.com/5228/5593713101_9c1ab28906_z.jpg

Figure 2 - Progress Event when parser processes blocks (2048 bytes) of data

The above mentioned events (see Figure 1 & Figure 2) can be implemented as LSLoadEvent and LSProgressEvent according to the specification. Since these classes are event oriented they should reside under org.apache.xerces.dom.events.* package and they should be extensions of base EventImpl class.

This is how EventListeners are triggered by particular event:

http://farm6.static.flickr.com/5189/5593712795_cf3d28e53e_z.jpg

Figure 3

When multiple event listeners are listening to the same event (say E1) all of them should be triggered upon the event (E1). The diagram below shows how EventListeners are triggered by a particular event (see Figure 3). As the EventListener(s) is triggered the handleEvent() method should be invoked[4].

Return value of parseURI() methods is dependent on the asynchronous property of the LSParser. At an instance where the document object may not be constructed the return value should be “null”. So this also should be addressed in the implementation which is somewhat similar to parse().

2. Implementation of parseWithContext() functionality

Implementation of parseWithContext() is the second part of the project which is a very important feature for XML application developers. At the moment Xerces-J does not support the facility to allow a document fragment to be parsed and attached to an existing DOM.

In order to insert a fragment to an existing document, the fragment should be identified by an LSInput. The parameters of method (namely: input, contextArg and action) defines the input identified by the LSInput, the node that is used as the context for the data that is being parsed and the action which should be taken between new set of nodes and existing children of the context node[7]. The “input” parameter for parseWithContext() should be an XML fragment (anything except a complete XML document), a DOCTYPE, entity declaration(s), etc. But there are some special cases where we should handle. One such case is the “input” being a whole XML document rather than an XML fragment and the “action” is ACTION_REPLACE_CHILDREN. Now this can be processed as a whole XML document just like the input was parsed using the regular LSParser.parse() method.

As per discussion with Michael Glavassevich in the mailing list[5], he suggested to implement the method (where context node is not a Document node) by synthesizing a wrapper XML document which contains a reference to the fragment and necessary content (for example, namespace declarations) required to parse the document. Later, the nodes created for the fragment can be transferred into the existing DOM. The following demonstration and diagram describes the high level approach (see Figure 4).

Consider the existing document :

<ns1:a xmlns:ns1=”http://ns1”>

<ns2:b xmlns:ns2=”http://ns2”>

</ns1:a>

Then we want to add the fragment below, as a child of “ns2:b” :

<ns2:c/><ns1:d/>

Then we can generate a wrapper document instead as follows :

<!DOCTYPE DUMMY_ROOT [

 <!ENTITY fragment PUBLIC "***" "***">

]>

<DUMMY_ROOT xmlns:ns1="http://ns1" xmlns:ns2="http://ns2">&fragment;</DUMMY_ROOT>

Here the “fragment” entity points to the XML fragment provided by the user and then we can parse this document as a normal XML document. Then we can merge new nodes underneath the entity reference with the existing document resulting:

<ns1:a xmlns:ns1=”http://ns1”>

<ns2:b xmlns:ns2=”http://ns2”><ns2:c/><ns1:d/></ns2:b>

</ns1:a>

http://farm6.static.flickr.com/5024/5594301286_4104aba36f_z.jpg

Figure 4 - High level approach for parseWithContext()

Deliverables

1. Source and build files for Asynchronous LSParser and parseWithContext()

2. Test cases to verify the required functionalities of the project

3. Necessary documentations/APIs

Things done so far

1. Checked out and built the Apache Xerces-J trunk

2. Set up development environment

3. Went through the coding disciplines and styles

4. Went through the related classes and interfaces of the project (for example : DOMParserImpl, XMLFragmentScannerImpl, etc)

5. Went through many XML and XML Schema tutorials to gain a better understanding about XML

6. Went through the DOM Load and Save w3c recommendation multiple times to achieve a good understanding of the project

Development Schedule

Prior start coding I will improve areas that I have to be strong when proceeding with the project. I will go through XML & XML Schema which will be advantageous in the second part of the project (parseWithContext()). I will also attempt to come up with a more stable architecture, discussing with the mentor. In the upcoming four months I am looking forward to enjoy coding and hope to spend at least average of 35 hours per week for the project.

April 26 – May 22

May23 – July 10

July 11 – July 15

July 16 – Aug 14

August 15 – August 21

August 22 ü Begin submitting final evaluations to Google August 30 ü Submit required code samples to Google

Community Interaction

From the moment I had a thought in my mind to join with a GSOC project, I was looking for opportunities that I might get in Apache Xerces-J. As a result, I went through the list of projects which were not covered in the earlier GSOCs for Xerces-J. Then I subscribed to the mailing list and asked the community whether I can hold this project. Sooner I got a reply from Michael Glavassevich(Project Lead) saying that this project is still available for 2011. Then I checked out and built the Apache Xerces-J trunk and went through the coding disciplines of the project to gain a simple idea of how a massive project behaves. Later I started a discussion with Michael about the project under the topic “New to Apcahe Xerces” and continued the discussion over two months period and achieved a good overall understanding about the project[6].

About Me

I am Thiwanka Somasiri, a final year undergraduate from Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. I have few years of coding experience in Java and C programming languages. I have made use of XML parsing in one of projects (developed using J2SE) in my internship to provide a customizable feature for the feasibility of Quality Assurance people. I have some open source software development experience by developing a add-on for bulk image downloading purposes for Mozilla Firefox using JavaScript and SUL.

As my final year group project, I am implementing a language independent web personalization framework for e-commerce applications which analyzes user navigations, user activity history, etc using concepts such as Data Mining, Machine Learning, etc. To achieve independency, we are using XML in this project and hope that it will be very useful when continuing with this project.

It is my pleasure to have an opportunity to contribute to a massive organization like Apache and I would like to continue working with Apache Xerces-J and to be a committer in near future. The Xerces-J community is very supportive for the newcomers and I would always encourage beginners to join with this fantastic community and work together.

References

[1]. http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html

[2]. http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSParser

[3]. http://www.w3.org/TR/DOM-Level-2-Events/events.html#Events-EventTarget

[4]. http://www.w3.org/TR/DOM-Level-2-Events/events.html#Events-EventListener

[5]. http://markmail.org/thread/x4adifemzae5comi#query:+page:1+mid:ir6z3e3ryjvhoa4w+state:results

[6]. http://markmail.org/thread/x4adifemzae5comi

[7]. http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput

[8]. http://www.w3.org/TR/DOM-Level-2-Events/events.html#Events-Registration-interfaces

[9]. http://www.w3.org/TR/DOM-Level-2-Events/events.html#Events-EventListener

ThiwankaSomasiri/gsoc_2011_XERCESJ-1429_proposal (last edited 2011-04-09 16:59:21 by Michael Glavassevich)