Writing Custom Transformers

If you need any kind of custom processing before sending the row to Solr, you can write a transformer of your own. Let us take an example use-case. Suppose, you have a single-valued field named "artistName" in your schema which is of type="string" which you want to facet upon and therefore no index-time analysis should be done on this field. The value can contain multiple words like "Celine Dion" but there's a problem, your data contains extra leading and trailing whitespace which you want to remove. The WhitespaceAnalyzer in Solr can't be applied since you don't want to tokenize the data into multiple tokens. A solution is to write a TrimTransformer.

A Simple TrimTransformer

package foo;
public class TrimTransformer    {
        public Object transformRow(Map<String, Object> row)     {
                String artist = row.get("artist");
                if (artist != null)             
                        row.put("ar", artist.trim());

                return row;
        }
}

No need to extend any class. Just write any class which has a method named transformRow with the above signature and DataImportHandler will instantiate it and call the transformRow method using reflection.

But of course you may extend the abstract class org.apache.solr.handler.dataimport.Transformer.

<entity name="artist" query="..." transformer="foo.TrimTransformer">
        <field column="artistName" />
</entity>


What about returning values like null or an empty List?

Let TS = {t«0», t«1», t«2», ..., t«n»} be the transformer (ordered) list configured for a particular entity and say t«i», with i <= n and i >= 0, returns null for a particular row (or for all the rows that a previous Transformer that creates multiple rows from the original one), then there are two cases. If no transformer t«k», with k < i, returns a java.util.List of rows, then that particular row is ignored, that means, it won't be inserted, deleted or updated. Additionally no transformer t«j», with j > i, will be invoked. If there is a transformer t«k», with k < i, that returns a java.util.List of rows for that particular row, and t«i» returns null for all those recently created rows, then a  java.lang.IndexOutBoundsException  will be thrown.

Now, if at any time a transformRow call returns an empty java.util.List, a  java.lang.IndexOutBoundsException  will be thrown. So you should never return an empty java.util.List.


A General TrimTransformer

Suppose you want to write a general TrimTransformer without hardcoding the column on which it needs to operate. Now we'd need to have a flag on the field in data-config.xml to indicate that the TrimTransformer should apply itself on this field.

<entity name="artist" query="..." transformer="foo.TrimTransformer">
        <field column="artistName" trim="true" />
</entity>

Now you'll need to extend the Transformer abstract class and use the API methods in Context to get the list of fields in the entity and get attributes of the fields to detect if the flag is set.

package foo;
public class TrimTransformer extends Transformer        {

        public Map<String, Object> transformRow(Map<String, Object> row, Context context) {
                List<Map<String, String>> fields = context.getAllEntityFields();

                for (Map<String, String> field : fields) {
                        // Check if this field has trim="true" specified in the data-config.xml
                        String trim = field.get("trim");
                        if ("true".equals(trim))        {
                                // Apply trim on this field
                                String columnName = field.get(DataImporter.COLUMN);
                                // Get this field's value from the current row
                                Object value = row.get(columnName);
                                // Trim and put the updated value back in the current row
                                if (value != null)
                                        row.put(columnName, value.toString().trim());
                        }
                }

                return row;
        }

}

If the field is multi-valued, then the value returned is a List instead of a single object and would need to handle appropriately.

Adding DataImportHandler dependencies

If you are using the Transformer and Context abstract classes, you will need to add the jar for DataImportHandler to your project as a dependency. This can be done by specifying a Class-Path in your Manifest file.

If you export your jar to the shared instance directory (solr/lib), then your Manifest file may look something like this:

Manifest-Version: 1.0
Class-Path: ../../../dist/apache-solr-dataimporthandler.jar

DIHCustomTransformer (last edited 2012-10-02 10:29:40 by 2-232-11-16)