Differences between revisions 20 and 21
Revision 20 as of 2014-06-13 04:36:44
Size: 7208
Editor: ShalinMangar
Comment: Added question on tinyint being converted to boolean
Revision 21 as of 2014-06-13 04:41:29
Size: 7208
Editor: ShalinMangar
Comment: Fixed typo
Deletions are marked like this. Additions are marked like this.
Line 103: Line 103:
=== Why does DIH convery my tinyint(1) columns to boolean values in Solr? === === Why does DIH convert my tinyint(1) columns to boolean values in Solr? ===

I'm using DataImportHandler with a MySQL database. My table is huge and DataImportHandler is going out of memory. Why does DataImportHandler bring everything to memory?

DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.

Should look like:

<dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:8889/mysqldatabase" batchSize="-1" user="root" password="root"/>

I'm using DataImportHandler with MS SQL Server database with sqljdbc driver. DataImportHandler is going out of memory. I tried adjustng the batchSize values but they don't seem to make any difference. How do I fix this?

There's a connection property called responseBuffering in the sqljdbc driver whose default value is "full" which causes the entire result set to be fetched. See http://msdn.microsoft.com/en-us/library/ms378988.aspx for more details. You can set this property to "adaptive" to keep the driver from getting everything into memory. Connection properties like this can be set as an attribute (responseBuffering="adaptive") in the dataSource configuration OR directly in the jdbc url specified in DataImportHandler's dataSource configuration.

Note that this is not an issue since version 1.2 of the SQL Server JDBC driver. Since version 1.2, "adaptive" is the default setting for responseBuffering.

Is it possible to use core properties inside data-config xml?

If you define the property inside of your <core/> in solr.xml then you can refer to the property in your data-config.xml file directly. However if you define it in the <solr/> level, so that it applies to multiple cores, then you need to use the technique below:

  • Add your property in the invariant section of solrconfig's DataImportHandler element. For example, add this section:

  • <lst name="invariants"> <str name="xmlDataDir">${xmlDataDir}</str> </lst>

  • Use it as ${dataimporter.request.xmlDataDir} in your data-config to access this.
  • see the mail thread

How would I insert a static value into a field ?

add the TemplateTransformer to the transformer attribute of the entity and create a field as follows

<field column="the_static_field_name" template="any-string-can-be-put-here"/>

Why is Transformer specified on entity instead of field ?

A few reasons

  • A field declaration is optional because it can be implicit if it is present in the schema
  • A Transformer can act on multiple fields . Or it can even do nothing to the data. Say a LogTransformer just logs data to a file

  • Transfomers can be chained . If declared in a field the order of chaining becomes unclear

My delta-import goes out of memory . Any workaround ?

It is possible to do delta import using full-import. Taken from the delta-import example

<dataConfig>
    <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
    <document>
          <entity name="findDelta" query="select id from item where id in
                                (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')
                                or id in 
                                (select item_id as id from item_category where item_id in 
                                    (select id as item_id from category where last_modified > '${dataimporter.last_index_time}')
                                or last_modified > '${dataimporter.last_index_time}')
                                or last_modified > '${dataimporter.last_index_time}'" rootEntity="false">
            <entity name="item" query="select * from item where id='${findDelta.id}'>
            <entity name="feature" query="select description as features from feature where item_id='${item.ID}'">                
            </entity>
            <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">
                <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">                    
                </entity>
            </entity>
        </entity>
      </entity>
    </document>
</dataConfig>

notice that for the entity 'findDelta' rootEntity is set to false, so a document is created for each row from 'item'. The command has to be

command=full-import&clean=false

How do I use DataImportHandler with multiple Solr Cores?

If you are using multiple Solr indexes (or multiple cores) in the same Solr webapp, each core will run a separate instance of DataImportHandler and each core should have its own data-config.xml file in the conf directory. See MultipleIndexes for more details on using multiple indexes.

Running Out of Memory with Postgresql

<!> Solr1.4 No more memory issues with this data source configuration:

<dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://host/db" user="user" password="password" readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT"/>

Using v8.3 of postgres server and JDBC driver.

How do I use a JNDI DataSource?

<!> Solr1.4 Setup your JNDI datasource and specify its name in the data-config.xml

 <dataSource
      jndiName="java:comp/env/jdbc/myDataSource"
      type="JdbcDataSource"
      user="" password=""/>

Blob values in my table are added to the Solr document as object strings like B@1f23c5

The problem occurs because blobs are read as a byte array which does not have a useful toString method. The workaround is to:

  1. Use convertType="true" on the JdbcDataSource

  2. Write a Transformer which converts the byte[] into a string type
  3. You can also use a "cast" sql function which can convert the data type from blob to strings if your database provides such a function

Note that "convertType" attribute uses the target schema field's type name to convert the value returned by the result set. Therefore, any transformers being used on the entity should be aware of the type information.

Invalid dates (e.g. "0000-00-00") in my MySQL database cause my import to abort

Use &zeroDateTimeBehavior=convertToNull parameter in you JDBC connection string to avoid this.

Why does DIH convert my tinyint(1) columns to boolean values in Solr?

This is an oddity of the JDBC specification where tinyint(1) values are treated as bits and their corresponding Java type is a boolean. Use tinyInt1isBit=false parameter in your JDBC connection string to avoid the conversion to boolean value.

DataImportHandlerFaq (last edited 2014-06-13 04:41:29 by ShalinMangar)