Differences between revisions 14 and 15
Revision 14 as of 2006-05-08 09:13:44
Size: 15679
Editor: gateway
Comment: FIXED external link
Revision 15 as of 2009-09-20 23:35:42
Size: 15725
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
The Google Data API (GDATA) is a new protocol based on Atom 1.0 and RSS 2.0. GDATA combines the features of these XML-based syndication formats with a feed–publishing system based on the Atom publish protocol. The goal of the Google Data API is to provide a manageable web service including a versioning system, query handling and authentication following the REST approach.[[BR]] The Google Data API (GDATA) is a new protocol based on Atom 1.0 and RSS 2.0. GDATA combines the features of these XML-based syndication formats with a feed–publishing system based on the Atom publish protocol. The goal of the Google Data API is to provide a manageable web service including a versioning system, query handling and authentication following the REST approach.<<BR>>
Line 9: Line 9:
 [http://code.google.com/apis/gdata/overview.html GDATA]
 
[http://www.ietf.org/html.charters/atompub-charter.html ATOM]
 
[http://feedvalidator.org/docs/rss2.html RSS]
 
[http://www.xml.com/lpt/a/2004/12/01/restful-web.html REST]
 
[http://lucene.apache.org/ Lucene]
 
[http://www.apache.org/licenses/ Apache Licene]
 [[http://code.google.com/apis/gdata/overview.html|GDATA]]
 [
[http://www.ietf.org/html.charters/atompub-charter.html|ATOM]]
 [
[http://feedvalidator.org/docs/rss2.html|RSS]]
 [
[http://www.xml.com/lpt/a/2004/12/01/restful-web.html|REST]]
 [
[http://lucene.apache.org/|Lucene]]
 [
[http://www.apache.org/licenses/|Apache Licene]]
Line 20: Line 20:
As XML–based formats are highly extensible with new specifications coming up periodically it is quite straight forward to provide an API, including the base functionality of the service specifications to be extended in the future. [[BR]] As XML–based formats are highly extensible with new specifications coming up periodically it is quite straight forward to provide an API, including the base functionality of the service specifications to be extended in the future. <<BR>>
Line 22: Line 22:
Retrieving content is not always meant to get the entire contents of the requested resource, and this is where Lucene comes into play. By indexing incoming entries, queried content can be retrieved in a very efficient way. The Lucene search engine library is able to index content individually to support all standard query parameters and any custom parameter, by indexing defined parts of the content in specific fields, thus making it an excellent choice of search component for the GDATA service [[BR]] Retrieving content is not always meant to get the entire contents of the requested resource, and this is where Lucene comes into play. By indexing incoming entries, queried content can be retrieved in a very efficient way. The Lucene search engine library is able to index content individually to support all standard query parameters and any custom parameter, by indexing defined parts of the content in specific fields, thus making it an excellent choice of search component for the GDATA service <<BR>>
Line 25: Line 25:
[http://java.sun.com JAVA] [[http://java.sun.com|JAVA]]
Line 37: Line 37:
The structure of the persistent files will be chosen to be as simple as possible to keep it easily readable by humans like the XML is. Data will be stored as described above on the local file system organized in a folder structure. To create a new feed it will be sufficient to create a new folder structure including an empty GDATA format file without any entries.[[BR]] The structure of the persistent files will be chosen to be as simple as possible to keep it easily readable by humans like the XML is. Data will be stored as described above on the local file system organized in a folder structure. To create a new feed it will be sufficient to create a new folder structure including an empty GDATA format file without any entries.<<BR>>
Line 40: Line 40:
[http://wiki.apache.org/nutch/NutchDistributedFileSystem NDFS] [[http://wiki.apache.org/nutch/NutchDistributedFileSystem|NDFS]]
Line 45: Line 45:
[http://code.google.com/apis/gdata/protocol.html#Document-format GDATA document format]
[http://www.w3.org/TR/xslt XSLT]
[[http://code.google.com/apis/gdata/protocol.html#Document-format|GDATA document format]]
[[http://www.w3.org/TR/xslt|XSLT]]
Line 53: Line 53:
[http://code.google.com/apis/accounts/Authentication.html Google Account Authentication] [[http://code.google.com/apis/accounts/Authentication.html|Google Account Authentication]]
Line 56: Line 56:
To put CRUD–Actions under authorization in a highly flexible way, the authorization Servlet-Filter could be used and mapped in the deployment descriptor of the application. This enables the service provider to restrict any resource independently.[[BR]] To put CRUD–Actions under authorization in a highly flexible way, the authorization Servlet-Filter could be used and mapped in the deployment descriptor of the application. This enables the service provider to restrict any resource independently.<<BR>>
Line 59: Line 59:
[http://tomcat.apache.org/tomcat-5.5-doc/servletapi/index.html Servlet-Filter] [[http://tomcat.apache.org/tomcat-5.5-doc/servletapi/index.html|Servlet-Filter]]
Line 64: Line 64:
While Lucene is able to index various kinds of documents and the GDATA format can be extended using foreign namespaces, it might be straight forward to make index fields, custom query parameters and parameters configurable. For this purpose a configuration file will be available in the classpath to be read on application start-up.[[BR]]
Due to indexing, the XML document must be parsed by an XML–parser using an XML-library like dom4J to build up an object representation of the document. Each element of the XML can be indexed and configured by an X-Path expression in the configuration file. The field-type can also be defined using specified keywords like “stored,” “text” or “keyword.” By following this approach, boosting fields can be applied in the configuration file as well.[[BR]]
While Lucene is able to index various kinds of documents and the GDATA format can be extended using foreign namespaces, it might be straight forward to make index fields, custom query parameters and parameters configurable. For this purpose a configuration file will be available in the classpath to be read on application start-up.<<BR>>
Due to indexing, the XML document must be parsed by an XML–parser using an XML-library like dom4J to build up an object representation of the document. Each element of the XML can be indexed and configured by an X-Path expression in the configuration file. The field-type can also be defined using specified keywords like “stored,” “text” or “keyword.” By following this approach, boosting fields can be applied in the configuration file as well.<<BR>>
Line 67: Line 67:
It might occur that special elements like dates need to indexed and made searchable as well. In these special cases the configuration file is still extensible to provide attributes to specify the Java type and possibly a parser instance.[[BR]] It might occur that special elements like dates need to indexed and made searchable as well. In these special cases the configuration file is still extensible to provide attributes to specify the Java type and possibly a parser instance.<<BR>>
Line 75: Line 75:
[http://www.dom4j.org DOM4J] [[http://www.dom4j.org|DOM4J]]
Line 78: Line 78:
By creating or updating an entry to an existing feed, the entry will be forwarded to the Lucene indexer–component. Indexing entries will run concurrent to the GDATA request to prevent the request from taking to much time.[[BR]] By creating or updating an entry to an existing feed, the entry will be forwarded to the Lucene indexer–component. Indexing entries will run concurrent to the GDATA request to prevent the request from taking to much time.<<BR>>
Line 81: Line 81:
[http://www.w3.org/TR/xpath X-path] [[http://www.w3.org/TR/xpath|X-path]]
Line 84: Line 84:
Incoming queries will be translated into Lucene query syntax and passed to the index searcher. In the case of extending the GDATA queries it would be considerable for future development to create a type-2 grammar to parse the query. For the current complexity of the queries, a simple Interpreter – pattern [GOF] would be satisfactory.[[BR]] Incoming queries will be translated into Lucene query syntax and passed to the index searcher. In the case of extending the GDATA queries it would be considerable for future development to create a type-2 grammar to parse the query. For the current complexity of the queries, a simple Interpreter – pattern [GOF] would be satisfactory.<<BR>>
Line 92: Line 92:
[http://lucene.apache.org/java/docs/queryparsersyntax.html Lucene query syntax]
[http://www.dofactory.com/Patterns/PatternInterpreter.aspx Interpreter - Pattern]
[[http://lucene.apache.org/java/docs/queryparsersyntax.html|Lucene query syntax]]
[[http://www.dofactory.com/Patterns/PatternInterpreter.aspx|Interpreter - Pattern]]
Line 105: Line 105:
[http://code.google.com/apis/gdata/protocol.html#Optimistic-concurrency Optimistic concurrency] [[http://code.google.com/apis/gdata/protocol.html#Optimistic-concurrency|Optimistic concurrency]]
Line 108: Line 108:
The last milestone will be to do integration tests on how the different components interact together. There are always some parts which need to be improved and/or fixed at this stage of development. Some performance testing using a profiler may be needed.[[BR]] The last milestone will be to do integration tests on how the different components interact together. There are always some parts which need to be improved and/or fixed at this stage of development. Some performance testing using a profiler may be needed.<<BR>>
Line 121: Line 121:
To guarantee an appropriate quality of code it is essential to archive a large coverage of unit tests for the entire code. This will be done during development following the test driven development approach.[[BR]] To guarantee an appropriate quality of code it is essential to archive a large coverage of unit tests for the entire code. This will be done during development following the test driven development approach.<<BR>>
Line 138: Line 138:
During my time as a professional developer I have worked on several projects integrating Lucene in complex J2EE applications. Most notably, an “Application Service Provider” - web engine based on Lucene and Larm.[[BR]]
My areas of interest include information retrieval and text analysis, and I have been looking for a way to participate in an ASF – open source project for a while.[[BR]]
During my time as a professional developer I have worked on several projects integrating Lucene in complex J2EE applications. Most notably, an “Application Service Provider” - web engine based on Lucene and Larm.<<BR>>
My areas of interest include information retrieval and text analysis, and I have been looking for a way to participate in an ASF – open source project for a while.<<BR>>
Line 143: Line 143:
 * [http://www.bfai.de] (persistent layer / user management / Lucene)
 * Init Search - [http://www.init.de/service/suche/index.html] (Search / Webservice)
 * E-Paper ([http://www.init.de/service/infobox/index.html]) (complete development)
 * ASP-SEARCH ([http://www.init.de/service/suche/index.html])
 * [http://www.jugendopposition.de] (CMS/backend)
 * [[http://www.bfai.de]] (persistent layer / user management / Lucene)
 * Init Search - [[http://www.init.de/service/suche/index.html]] (Search / Webservice)
 * E-Paper ([[http://www.init.de/service/infobox/index.html]]) (complete development)
 * ASP-SEARCH ([[http://www.init.de/service/suche/index.html]])
 * [[http://www.jugendopposition.de]] (CMS/backend)
Line 151: Line 151:
[http://larm.sf.net/ LARM] [[http://larm.sf.net/|LARM]]
Line 163: Line 163:
On the google - Sommer Of Code website students are asked for “…and the reason you're the best individual to do so”, I thought about this for a while. Well, I’m actually using open source – software for a while and expect high quality products especially using Apache projects. All the developers providing their skills to the community and in my opinion it’s just time to give something back.[[BR]] On the google - Sommer Of Code website students are asked for “…and the reason you're the best individual to do so”, I thought about this for a while. Well, I’m actually using open source – software for a while and expect high quality products especially using Apache projects. All the developers providing their skills to the community and in my opinion it’s just time to give something back.<<BR>>

SimonWillnauer/SummerOfCode2006

Subject

GData Server

Author

Simon Willnauer, Berlin, Germany

Abstract

The Google Data API (GDATA) is a new protocol based on Atom 1.0 and RSS 2.0. GDATA combines the features of these XML-based syndication formats with a feed–publishing system based on the Atom publish protocol. The goal of the Google Data API is to provide a manageable web service including a versioning system, query handling and authentication following the REST approach.
The GDATA API offers client side libraries for accessing GDATA services via the HTTP Protocol. The purpose of this project is to provide a server-side GDATA service implementation by integrating the Apache Lucene text search engine library licensed according the Apache License.

Project Overview

Syndication users have met their objectives in providing and aggregating content universally accessible in an independent way. GDATA aims to extend common XML-based syndication formats by integrating update, insert, delete, version, query and authentication functionality into an http based web-service. This project intends to provide an implementation of a GDATA compliant service provided by a common Servletcontainer.

As XML–based formats are highly extensible with new specifications coming up periodically it is quite straight forward to provide an API, including the base functionality of the service specifications to be extended in the future.
By interacting with the GDATA client library, the service offers an interface to access feed content in a highl y flexible way. As described in the REST approach, a GDATA service can be accessed by the HTTP methods PUT, GET, POST and DELETE. These methods correspond to the UPDATE, RETRIEVE, CREATE and DELETE actions of the service observed by the version-component. Beside this, only authorized users may alter content. Retrieving content is not always meant to get the entire contents of the requested resource, and this is where Lucene comes into play. By indexing incoming entries, queried content can be retrieved in a very efficient way. The Lucene search engine library is able to index content individually to support all standard query parameters and any custom parameter, by indexing defined parts of the content in specific fields, thus making it an excellent choice of search component for the GDATA service
Development will be based on the Lucene 1.9 final release and Java 5.0 technology.

JAVA

Project Description

The Project will be divided in several Milestones which can be developed fairly isolated from each other. To ensure quantifiable results for the mentor organization the Milestones represent an escalation strategy for delivering a working result even if the project can't be finished in time. During my work, I’ve experienced escalation strategies to be very important, especially in software projects.

Milestone 1 – Implementing the CRUD – Actions

Provide CRUD-Actions

The first development step is to implement the basic functionality of a REST architecture. The System must be able to retrieve, create, update and delete entries from feed documents. The documents and entries will be organized and persisted in a data structure on the local file system comparable with a flat file database. The http interface represented by a service-servlet accepts PUT, DELETE, GET and POST requests. The request will be validated, processed and the service-response, including the corresponding response-code and the requested data, will be send back to the client.

Organization of the persistent data

The structure of the persistent files will be chosen to be as simple as possible to keep it easily readable by humans like the XML is. Data will be stored as described above on the local file system organized in a folder structure. To create a new feed it will be sufficient to create a new folder structure including an empty GDATA format file without any entries.
This is quiet the easiest way to persist data and will be sufficient for this state of the project. I saw a couple of good ideas to organize xml data for instance in a XML-based database like Apache – xindice . To keep this implementation platform - independent and getting everything out of Lucene it would be a quite good idea to use a similar approach to the Nutch Distributed File System (NDFS) an keep the xml-data inside a separate Lucene index. In this approach there will be several Lucene index around, one for performing search and one for retrieve data otherwise the index will grow rapidly.

NDFS

Result format

Feeds and Entries will be persisted in the incoming GDATA document format and transformed, possibly by XSLT , into the requested format using ATOM as default. XSLT could provide a very flexible way to extend the output format of the GDATA Service in future considerations.

GDATA document format XSLT

Milestone 2 – Integrating the authentication component

Authorization component

Authentication as security itself is always a very important part of distributing data. Operations which update, delete or create entries must only be accessible to authorized users. Sometimes retrieving information from a service is also restricted due to a private feed document. The GDATA specification describes the authentication mechanism in Google Account Authentication . In this milestone the authentication component will be integrated into the application. Like any other functionality in the system, the authentication will be available via an http/https interface represented by a servlet. The servlet will authenticate the user credentials against a REALM – database. If authentication succeeds, then the authentication system returns a token that the client subsequently uses (in an HTTP Authorization header) when it sends GData requests. If authentication fails, then the server returns a 403 Forbidden status code, along with a WWW-Authenticate header containing a challenge applicable to the authentication.

Google Account Authentication

Configure Authorization

To put CRUD–Actions under authorization in a highly flexible way, the authorization Servlet-Filter could be used and mapped in the deployment descriptor of the application. This enables the service provider to restrict any resource independently.
Providing the authentication via HTTP or HTTPS is still up to the service provider configuring a connector for the authentication interface, and the application is not affected at all.

Servlet-Filter

Milestone 3 – Create the query component / integrating Lucene

Index – field configuration

While Lucene is able to index various kinds of documents and the GDATA format can be extended using foreign namespaces, it might be straight forward to make index fields, custom query parameters and parameters configurable. For this purpose a configuration file will be available in the classpath to be read on application start-up.
Due to indexing, the XML document must be parsed by an XML–parser using an XML-library like dom4J to build up an object representation of the document. Each element of the XML can be indexed and configured by an X-Path expression in the configuration file. The field-type can also be defined using specified keywords like “stored,” “text” or “keyword.” By following this approach, boosting fields can be applied in the configuration file as well.

It might occur that special elements like dates need to indexed and made searchable as well. In these special cases the configuration file is still extensible to provide attributes to specify the Java type and possibly a parser instance.
The following snippet shows an example of a configuration element:

<field type=”keyword” fieldname=”author” boostfactor=”1.2”>
    <path>/entry/author/name</path>
</field>

DOM4J

Indexing entries

By creating or updating an entry to an existing feed, the entry will be forwarded to the Lucene indexer–component. Indexing entries will run concurrent to the GDATA request to prevent the request from taking to much time.
The entry will be transformed into an object representation and the content will be retrieved by the indexer–component. The indexer tries to create each field of the document according to the description of index field configuration. If one of the configured fields cannot be accessed due to an X-path error or similar exception, the indexer–component will write log messages in the indexer log and continue indexing the entry without the field which caused the error.

X-path

Parsing Queries

Incoming queries will be translated into Lucene query syntax and passed to the index searcher. In the case of extending the GDATA queries it would be considerable for future development to create a type-2 grammar to parse the query. For the current complexity of the queries, a simple Interpreter – pattern [GOF] would be satisfactory.
The Searcher detects if the parameters specified in the query are valid and starts the query against the Lucene index. If the parameters are invalid, e.g. the parameter is not recognized, the service will respond with a: 400 BAD REQUEST Response code like the following example:

HTTP/1.1 400 Invalid parameter: 'title'

Lucene query syntax Interpreter - Pattern

GOF -- Gang of Four - Design Patterns: Elements of Reusable Object-Oriented Software

Milestone 4 – Implement the Version-System

Versions

The different versions of an entry can be managed in several ways using similar procedures like incremental versions in CVS or just keeping the old and the new version. To keep this section simple in the first cycle of development, I would keep versions of an entry in their entirety which would simplify the retrieval of the versions a lot. The GDATA –Protocol description doesn’t says much about retrieving or keeping older versions so I considered keeping the versions for a later use.

Optimistic concurrency

To ensure that multiple clients don't inadvertently overwrite one another's changes the .version file is introduced. The .version file contains the current version of the feed instance. If an update or create request comes in, the system locks a global object, which represents an operation or the file itself, and processes the update/create. If another request comes in the system, it gains read access to the locked object’s version property, compares it with the request’s version number to determine if the version number differs from the request’s, and then returns a 409 CONFLICT response code to the client.

Optimistic concurrency

Milestone 5 – Integration-tests and final documentation

The last milestone will be to do integration tests on how the different components interact together. There are always some parts which need to be improved and/or fixed at this stage of development. Some performance testing using a profiler may be needed.
The Java-doc documentation will be done during development so the final documentation will cover deployment, configuration and indexing topics.

Schedule

  • 09 June Milestone 1
  • 23 June Milestone 2
  • 14 July Milestone 3
  • 30 July Milestone 4
  • 22. August Milestone 5

Quality of deliverables

To guarantee an appropriate quality of code it is essential to archive a large coverage of unit tests for the entire code. This will be done during development following the test driven development approach.
The Java code will be documented using Java – Doc.

Future consideration

These considerations are no formal requirements of this proposal, but are sidetracks that could play a role in future development. By writing them down, they become part of the considerations for the current proposal without being a direct goal of the project as described above itself.

JMX integration

The Java Management Extension provides a very flexible and efficient tool for building distributed, web-based and modular solutions for managing applications. Indexing, authentication and feed-management could be easily integrated by using the commons-modeler library.

Cacheing

Retrieving the data could lead to a lot of disc I/O which might have a performance impact to the application. The most common request to the application will request for a complete feed instance and could be easily cached by a cache-implementations like OSCache.

About Me

Bio

I am a Computer Science major of the University of Applied Science of Berlin and have been working as a software developer in a professional environment since October 2004.

During my time as a professional developer I have worked on several projects integrating Lucene in complex J2EE applications. Most notably, an “Application Service Provider” - web engine based on Lucene and Larm.
My areas of interest include information retrieval and text analysis, and I have been looking for a way to participate in an ASF – open source project for a while.
This is quiet simply the best project I could imagine participating in. Projects I’ve worked on:

Unfortunately all these project - sheet or websites have no or very reduced english translation.

LARM

Development methodology

  • Test First Development
  • Using UML for modeling
  • Building project using Ant
  • CVS/SVN is a must
  • Building prototypes (to realize critical parts at an early stage)

Being the best individual

On the google - Sommer Of Code website students are asked for “…and the reason you're the best individual to do so”, I thought about this for a while. Well, I’m actually using open source – software for a while and expect high quality products especially using Apache projects. All the developers providing their skills to the community and in my opinion it’s just time to give something back.
In other words I’m grateful to have this opportunity and feel that I am well suited fro this project.

SimonWillnauer/SummerOfCode2006 (last edited 2009-09-20 23:35:42 by localhost)