Differences between revisions 7 and 8
Revision 7 as of 2007-03-19 19:59:59
Size: 11291
Editor: MamtaSatoor
Comment:
Revision 8 as of 2009-09-20 22:11:53
Size: 11303
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
[[TableOfContents(2)]]


This page describes how to implement a function to perform language based ordering. It is largely copied from the email suggestion by Daniel Debrunner [http://www.nabble.com/Language-based-ordering-is-possible-in-Derby-tf482505.html#a1313448 Language based ordering is possible in Derby] but incorporates suggestions from Oyvind to add language and country parameters and adds a function to sort by the JVM default locale.
<<TableOfContents(2)>>


This page describes how to implement a function to perform language based ordering. It is largely copied from the email suggestion by Daniel Debrunner [[http://www.nabble.com/Language-based-ordering-is-possible-in-Derby-tf482505.html#a1313448|Language based ordering is possible in Derby]] but incorporates suggestions from Oyvind to add language and country parameters and adds a function to sort by the JVM default locale.
Line 9: Line 9:
By default Derby performs Unicode character comparison for ordering. This is not suitable for many languages. [https://issues.apache.org/jira/browse/DERBY-1478 DERBY-1478] has been filed for built-in language based ordering for the long term. Work is being done on DERBY-1478 as part Derby 10.3 and more info on it can be found at ["BuiltInLanguageBasedOrderingDERBY-1478"]. A Derby language based ordering function may be a usable workaround for some applications until that issue has been resolved. By default Derby performs Unicode character comparison for ordering. This is not suitable for many languages. [[https://issues.apache.org/jira/browse/DERBY-1478|DERBY-1478]] has been filed for built-in language based ordering for the long term. Work is being done on DERBY-1478 as part Derby 10.3 and more info on it can be found at [[BuiltInLanguageBasedOrderingDERBY-1478]]. A Derby language based ordering function may be a usable workaround for some applications until that issue has been resolved.
Line 13: Line 13:
In order to solve the problem of language based ordering we will define Derby functions which use the built in Java [http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html java.text.Collator] class. These can be used in an SQL Statement for language based ordering. One issue with this is that no indexes are used for functions so there are performance implications with large data sets. Below is a summary of the operations and a possible replacement strategy: In order to solve the problem of language based ordering we will define Derby functions which use the built in Java [[http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html|java.text.Collator]] class. These can be used in an SQL Statement for language based ordering. One issue with this is that no indexes are used for functions so there are performance implications with large data sets. Below is a summary of the operations and a possible replacement strategy:
Line 19: Line 19:
|| GROUP BY || GROUP BY expression is not available in Derby so there is no current replacement. [https://issues.apache.org/jira/browse/DERBY-883 DERBY-883] is in progress ||
|| LIKE || There is not a built in java matching function to use as a replacement for String matching. A LIKE equivalent using CollationElementInterator should be possible but would be complex. See [http://www.nabble.com/Language-based-matching-tf1908121.html#a5271152 local based matching] discussion||
|| GROUP BY || GROUP BY expression is not available in Derby so there is no current replacement. [[https://issues.apache.org/jira/browse/DERBY-883|DERBY-883]] is in progress ||
|| LIKE || There is not a built in java matching function to use as a replacement for String matching. A LIKE equivalent using CollationElementInterator should be possible but would be complex. See [[http://www.nabble.com/Language-based-matching-tf1908121.html#a5271152|local based matching]] discussion||
Line 52: Line 52:
One general issue with this approach is that indexes won't be used. [http://issues.apache.org/jira/browse/DERBY-455 DERBY-455] would help here. One general issue with this approach is that indexes won't be used. [[http://issues.apache.org/jira/browse/DERBY-455|DERBY-455]] would help here.
Line 311: Line 311:
["DerbySQLroutines"]

["BuiltInLanguageBasedOrderingDERBY-1478"]
[[DerbySQLroutines]]

[[BuiltInLanguageBasedOrderingDERBY-1478]]

Table of Contents

This page describes how to implement a function to perform language based ordering. It is largely copied from the email suggestion by Daniel Debrunner Language based ordering is possible in Derby but incorporates suggestions from Oyvind to add language and country parameters and adds a function to sort by the JVM default locale.

Why do we need a Language based ordering function?

By default Derby performs Unicode character comparison for ordering. This is not suitable for many languages. DERBY-1478 has been filed for built-in language based ordering for the long term. Work is being done on DERBY-1478 as part Derby 10.3 and more info on it can be found at BuiltInLanguageBasedOrderingDERBY-1478. A Derby language based ordering function may be a usable workaround for some applications until that issue has been resolved.

The Start of a Solution

In order to solve the problem of language based ordering we will define Derby functions which use the built in Java java.text.Collator class. These can be used in an SQL Statement for language based ordering. One issue with this is that no indexes are used for functions so there are performance implications with large data sets. Below is a summary of the operations and a possible replacement strategy:

SQL OPERATION

Replacement Strategy

ORDER BY

Use ORDER BY expression functionality with LOCALE_ORDER function based on Collator.getCollationKey(String value)

<, >, BETWEEN

Use LOCALE_COMPARE function based on Collator.compare(String source, String target)

IN, =

It is not clear if IN or = are affected, since these are an exact match is the default Derby processing ok?

GROUP BY

GROUP BY expression is not available in Derby so there is no current replacement. DERBY-883 is in progress

LIKE

There is not a built in java matching function to use as a replacement for String matching. A LIKE equivalent using CollationElementInterator should be possible but would be complex. See local based matching discussion

Below is an example for ORDER BY. It is not a complete solution but a place to start. Please fix up the example here on the Wiki as you make it better and implement the other functions.

The trick is to write an user defined function that takes a string and returns a value that is correct for ordering on. Say we have a function LOCALE_ORDER that takes language, country and value, then you can write:

SELECT ID, NAME FROM CUSTOMER ORDER BY LOCALE_ORDER('pl','PL',name)

Java provides all the functionality required, so it's really just a few lines of Java code.

The code uses java.text.Collator to obtain a byte array that represents the collation key of a string and then Derby sorts on that byte array, as a VARCHAR FOR BIT DATA.

It needs some more investigation to ensure Derby orders the returned byte arrays correctly, particularly if one byte array is larger than the other, when up to that point they have the same contents. Also more checking is needed to see if the Derby ordering matches the expected language collation.

Another issue is the length of the returned byte array. We would need to see how long it can be and what would happen it it returned a value longer than the declared type of the SQL function.

An alternative would be to change the Java method to return a String representation of the byte array and sort on that.

One general issue with this approach is that indexes won't be used. DERBY-455 would help here.

Example SQL Functions

There are two example functions for lanuguage based ordering.

"LOCALE_ORDER" takes a language, a country and a value parameter and returns a collation key. The locale can be specified in the query.

Syntax:

LOCALE_ORDER(LanguageCode, CountryCode, value)

LanguageCode - lowercase two-letter ISO-639 code CountryCode - uppercase two-letter ISO-3166 code value - value to order on, typically a column name

Example:

SELECT ID, NAME FROM CUSTOMER ORDER BY LOCALE_ORDER('pl','PL',name)

JVM_DEFAULT_LOCALE_ORDER takes only a value and will sort based on the JVM default locale.

Syntax:

JVM_DEFAULT_LOCALE_ORDER(value)

value - value to order on, typically a column name

Example:

SELECT ID, NAME FROM CUSTOMER ORDER BY JVM_DEFAULT_LOCALE_ORDER(name)

To create the functions run the script below in ij

CONNECT 'jdbc:derby:testdb;create=true';

CREATE FUNCTION LOCALE_ORDER(
LANGUAGE_CODE VARCHAR(2),
COUNTRY_CODE VARCHAR(2),
VALUE VARCHAR(128))
RETURNS VARCHAR(255) FOR BIT DATA
LANGUAGE JAVA
PARAMETER STYLE JAVA
NO SQL
RETURNS NULL ON NULL INPUT
EXTERNAL NAME 'derbytest.LocaleOrder.localeOrder';

CREATE FUNCTION JVM_DEFAULT_LOCALE_ORDER(
VALUE VARCHAR(128))
RETURNS VARCHAR(255) FOR BIT DATA
LANGUAGE JAVA
PARAMETER STYLE JAVA
NO SQL
RETURNS NULL ON NULL INPUT
EXTERNAL NAME 'derbytest.LocaleOrder.defaultLocaleOrder';

Java code for function implementation

package derbytest;

import java.text.CollationKey;
import java.text.Collator;
import java.util.Locale;

public class LocaleOrder {


    /**
     *  Function that can be used by Derby for language based ordering.
     *  Uses java.text.Collator to obtain a byte array that represents
     *  the collation key of a string and then Derby sorts on that byte array.
     *
     * RESOLVE: These Comments from Dan who submitted the original code/idea
     *          to the Derby list
     *  - What will happen if one byte array is larger than the other,
     *    when up to that point they have the same
     *  - Test that Derby ordering matches the expected language collation.
     *  - An alternative is to return a String representation of the byte
     *   array and sort on that.
     *
     * @param language lowercase two-letter ISO-639 code.
     * @param country uppercase two-letter ISO-3166 code.
     * @param value Value for which collationKey is needed for ordering
     * @return collationKey for value
     *
     */
    public static byte[] localeOrder(String language, String country,
                                String value)
        {

                Locale locale  = new Locale(language, country);
                Collator collator = Collator.getInstance(locale);

                CollationKey key = collator.getCollationKey(value);

                byte[] rawData = key.toByteArray();

                return rawData;
        }

    /**
     * Function used for order by the JVM default Locale
     *
     * @param value Value for which collation key is needed for ordering
     * @return collation key for value based on default Locale
     * @see #localeOrder(String language, String country, String value)
     */
    public static byte[] defaultLocaleOrder(String value)
    {
        Locale locale  = Locale.getDefault();
        return localeOrder(locale.getLanguage(), locale.getCountry(), value);
    }
}

Test Code

package derbytest;

import java.sql.*;
import java.util.Locale;


import org.apache.derby.jdbc.EmbeddedDriver;

public class Test {

        private static final String[] NAMES =
        {
                // Just Smith, Zebra, Acorn with alternate A,S and Z
                "Smith",
                "Zebra",
                "\u0104corn",
                "\u017Bebra",
                "Acorn",
                "\u015Amith",
                "aacorn",
        };

        public static void main(String[] args) throws SQLException
        {
                new EmbeddedDriver();
                Connection conn = DriverManager.getConnection("jdbc:derby:testdb");

                Statement s = conn.createStatement();

                try {
                s.execute("CREATE TABLE CUSTOMER(ID INT, NAME VARCHAR(40))");

                conn.setAutoCommit(false);
                PreparedStatement ps = conn.prepareStatement("INSERT INTO CUSTOMER VALUES(?,?)");

                for (int i = 0; i < NAMES.length; i++)
                {
                        ps.setInt(1, i);
                        ps.setString(2, NAMES[i]);
                        ps.executeUpdate();
                }
                conn.commit();
                ps.close();
                } catch (SQLException sqle)
                {

                }

                System.out.println("DEFAULT Unicode character set ordering");
                query(s, "SELECT ID, NAME FROM CUSTOMER ORDER BY NAME");

                System.out.println("Polish ordering");
                query(s, "SELECT ID, NAME FROM CUSTOMER ORDER BY LOCALE_ORDER('pl','PL',NAME)");

                System.out.println("JVM Default Locale ordering for wherever you are:" +
                                Locale.getDefault());
                query(s, "SELECT ID, NAME FROM CUSTOMER ORDER BY JVM_DEFAULT_LOCALE_ORDER(NAME)");

                System.out.println("Norwegian ordering by forcing the Norwegian default locale");
                Locale.setDefault(new Locale("no","NO"));

                query(s, "SELECT ID, NAME FROM CUSTOMER ORDER BY JVM_DEFAULT_LOCALE_ORDER(NAME)");
        }

        private static void query(Statement s, String sql) throws SQLException
        {
                System.out.println(sql);
                ResultSet rs = s.executeQuery(sql);
                while (rs.next())
                {
                        System.out.print("  ");
                        System.out.print(rs.getString(1));
                        System.out.print(" , ");
                        System.out.println(rs.getString(2));
                }
                rs.close();
        }
}

Test Output

The test output shows the various language based orderings. For example aacorn sorts to the bottom for Norwegian.

$ java derbytest.Test
DEFAULT Unicode character set ordering
SELECT ID, NAME FROM CUSTOMER ORDER BY NAME
  4 , Acorn
  0 , Smith
  1 , Zebra
  6 , aacorn
  2 , ?corn
  5 , ?mith
  3 , ?ebra
Polish ordering
SELECT ID, NAME FROM CUSTOMER ORDER BY LOCALE_ORDER('pl','PL',NAME)
  6 , aacorn
  4 , Acorn
  2 , ?corn
  0 , Smith
  5 , ?mith
  1 , Zebra
  3 , ?ebra
JVM Default Locale ordering for wherever you are:en_US
SELECT ID, NAME FROM CUSTOMER ORDER BY JVM_DEFAULT_LOCALE_ORDER(NAME)
  6 , aacorn
  4 , Acorn
  2 , ?corn
  0 , Smith
  5 , ?mith
  1 , Zebra
  3 , ?ebra
Norwegian ordering by forcing the Norwegian default locale
SELECT ID, NAME FROM CUSTOMER ORDER BY JVM_DEFAULT_LOCALE_ORDER(NAME)
  4 , Acorn
  2 , ?corn
  0 , Smith
  5 , ?mith
  1 , Zebra
  3 , ?ebra
  6 , aacorn

DerbySQLroutines

BuiltInLanguageBasedOrderingDERBY-1478

LanguageBasedOrdering (last edited 2009-09-20 22:11:53 by localhost)