Troubleshooting the Installation

If you follow the installation instructions, Trafodion should start up correctly. However, since each system has a somewhat different setup, you may encounter problems during or after an install. This page talks about how Trafodion objects are set up, what CLASSPATHs and objects are required, and related topics. It also lists common installation problems and their solutions. If you need to solve such issues or if these topics interest you, please read on.

Zookeeper Configuration for Max Client Connections

To configure max client connections in Zookeeper on a Hortonworks cluster, do the following:

Log onto Ambari gui
Click on Zookeeper on left side
Select "Configs" on top tab
Scroll to bottom and click "Customer zoo.cfg"
Select "Add Property..."
Click "Add"
Restart Ambari
Stop Trafodion (sqstop from terminal)
Start Trafodion (sqstart from terminal)

Note: On a Cloudera cluster, the installer performs these actions automatically. (This will soon be true for Hortonworks as well.)

HBase Configuration for Trafodion

Before even looking at the Trafodion executables, you should make sure that HBase is set up correctly to work with it. Trafodion introduces new code to HBase in form of a jar containing HBase-trx. This jar is copied into the HBase lib directory during the installation process, or sometimes you might need to do this manually. Some things to consider:

The HBase-trx jar has a name that matches the pattern hbase-trx-*.jar. This jar must exist in the HBase lib directory, which is usually /usr/lib/hbase/lib.
In Trafodion Release 1.1.n, the required HBase release is 0.98.6 (for CDH 5.3) or 0.98.4 (for HDP 2.2).
- For the Cloudera CDH 5.3 distribution, the name of the HBase-trx jar is hbase-trx-1.1.n.jar.
- For the Hortonworks HDP 2.2 distribution, the name of the HBase-trx jar is hbase-trx-hdp2_2-1.1.n.jar.
In Trafodion Release 1.0.n, the required HBase release is 0.98.1 (for CDH 5.1) or 0.98.0 (for HDP 2.1).
- For the Cloudera CDH 5.1 distribution, the name of the HBase-trx jar is hbase-trx-1.0.n.jar.
- For the Hortonworks HDP 2.1 distribution, the name of the HBase-trx jar is hbase-trx-hdp2_1-1.0.n.jar.
Make sure there is exactly one such hbase-trx jar in the directory and that it is of the correct version.
Make sure the HBase-trx jar is secured correctly.
Make sure that the HBase-trx jar exists on all the nodes in the cluster that host the HBase Master or a RegionServer. You can use the pdsh and pdcp commands for this, see below for more information.
Make sure HBase is restarted after adding the HBase-trx jar.
Java 7 is the only supported Java version. The minimum Java 7 version required is 1.7.0_67.
Check the RegionServer logs for any error messages related to HBase-trx. A typical example: ClassNotFoundException:TransactionalRegionServer.

The Trafodion installer also adds several properties to the HBase configuration file, hbase-site.xml. For example:

<property>
     <name>hbase.hregion.impl</name>
     <value>org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion</value>
</property>
<property>
    <name>hbase.coprocessor.region.classes</name>
    <value>org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionObserver,
           org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint,
           org.apache.hadoop.hbase.coprocessor.AggregateImplementation
    </value> 
</property>

Make sure you see an entry to hbase.hregion.impl, pointing to org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion, and a coprocessororg.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion. Those are just two examples. If they are missing, something went wrong with the installation.

For Trafodion SQL aggregation, the org.apache.hadoop.hbase.coprocessor.AggregateImplementation coprocessor is used for non-HBase-trx transactional queries. Aggregation routines in theorg.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint coprocessor are used for HBase-trx transactional queries.

Once you feel that HBase should work correctly, validate this with the hbase shell, by using some simple commands like the following:

sudo su - trafodion
hbase shell
  create 'testtab','cf1'
  put 'testtab','row1','cf1:col1','val1'
  scan 'testtab'
  disable 'testtab'
  drop 'testtab'
  quit

Trafodion User ID

Trafodion runs under a user ID, trafodion, that has been created on your system by the Trafodion installer. The trafodion user ID needs to have passwordless ssh access to all nodes in the cluster. The installer should have created public and private keys in ~trafodion/.ssh to set this up. During setup, you should be able to su (or sudo su) to this user ID. Once things are set up, you probably won't want to use this ID anymore. It should definitely have a strong password, or, better, no password at all.

The installer adds commands to ~trafodion/.bashrc to set up some environment variables. Most of this setup is done through files $MY_SQROOT/sqenv.sh and $MY_SQROOT/sqenvcom.sh. MY_SQROOT is set up in .bashrc. Make some basic checks:

$MY_SQROOT needs to point to the install directory of Trafodion.
$DCS_INSTALL_DIR needs to point to the install directory of DCS (ODBC/JDBC connectivity services for Trafodion).
$MY_NODES should expand to a list of Trafodion nodes in the cluster. It has the form -w <node1> ... -w <noden>. This is used with pdcp and pdsh commands.
The trafodion user ID is able to ssh to all the nodes listed in $MY_NODES, without supplying a password.
$CLASSPATH is set up. See the next section.

sudo su - trafodion
echo $MY_SQROOT
echo $DCS_INSTALL_DIR
pdsh $MY_NODES pwd
echo $CLASSPATH

Trafodion Class Path

The class path is probably the main place where issues with Trafodion installation surface. It's a long CLASSPATH, consisting of the following parts:

HBase jars
Hadoop jars
Hive jars
Hadoop configuration directory
HBase configuration directory
Hive configuration directory
HBase-trx jar (another copy in the Trafodion install dir)
Trafodion jars (UDR, HBase access, JDBC Type 2 driver)

If you see elements of this CLASSPATH missing (often indicated by consecutive colons in the middle of the CLASSPATH), take a look at the sqenvcom.sh file and how it sets up the CLASSPATH. Often there is a simple fix to the problem. Note that the configuration directories are required to access HBase as well as to get to Hive metadata.

Typical issues with the class path:

HBase or Hadoop jars are missing: The $MY_SQROOT/sqenvcom.sh file tries to detect what Hadoop distribution you are running. If this goes wrong, the CLASSPATH may be missing many files.
Hive jars may be missing: This may be because Hive is not installed, and that should be acceptable. HBase access should still work. If you install Hive later, that's OK, but you will need to do another sqgen. See below for what that means.

Fixing class path issues may require a few extra steps. Let's say you fixed this by making a small patch to the $MY_SQROOT/sqenvcom.sh file, or by copying some jar /somedir/newjar.jar. Since Trafodion runs distributed on the cluster, it requires another step to propagate the patched environment to all nodes. There is a utility, called sqgen, that creates needed files to set up the CLASSPATH and other information and copies some generated files across the cluster. Here is an example of what you might do as the trafodion user. Please modify these commands as applicable to you:

sudo su - trafodion

# stop Trafodion
sqstop

cd $MY_SQROOT
# patch environment, for example in this file
vi sqenvcom.sh

# now use the pdcp command to copy the patched file to all nodes
# The $MY_NODES variable is handy to do this.
pdcp $MY_NODES sqenvcom.sh $PWD

# test the patch on all nodes by checking the classpath
pdsh $MY_NODES echo $CLASSPATH | grep /usr/lib/hbase/hbase-trx | wc -l

# copy other needed files
pdcp $MY_NODES /somedir/newjar.jar /somedir

# secure the new file
pdsh $MY_NODES chmod 644 /somedir/newjar.jar

# validate that the HBase-trx jars are in the right place (see HBase setup topic above)
# note that we put the name with wildcards in quotes, so it is resolved on the target system
pdsh $MY_NODES ls -l "/usr/lib/hbase/lib/*trx*"

# use a new shell (important) to generate a new environment for Trafodion
ssh localhost
sqgen
exit

# exit all shells with the old environment and restart Trafodion
exit
sudo su - trafodion
sqstart

Security

Trafodion uses a Java interface to access HBase, and the native libhdfs library to access the HDFS files of Hive tables. If you have set permissions for HBase and HDFS, they must permit the trafodion user ID to access the necessary HBase tables and HDFS files. The easiest way to test this is to run HBase, Hive and HDFS commands as the trafodion user. See above for simple test that can be done in the HBase shell.

Processes and Network

Again, due to the fact that Trafodion manages a cluster, there are a few special things to know and check. Trafodion has its own set of processes. As the trafodion user, you can use the sqps command to display all the Trafodion processes running on the cluster. Example (simplified):

sudo su - trafodion
$ sqps
...
[$Z000LSB] NID,PID(os)  PRI TYPE STATES  NAME        PARENT      PROGRAM
[$Z000LSB] ------------ --- ---- ------- ----------- ----------- ---------------
[$Z000LSB] 000,00022519 001 DTM  ES--A-- $TM0        NONE        tm             
[$Z000LSB] 000,00024392 001 GEN  ES--A-- $Z000JWX    NONE        mxosrvr        
[$Z000LSB] 000,00024441 001 GEN  ES--A-- $Z000JYB    NONE        mxosrvr        
[$Z000LSB] 000,00005348 001 GEN  ES--A-- $Z0004CT    $Z000JWX    tdm_arkcmp     
[$Z000LSB] 000,00013085 001 GEN  ES--A-- $Z000ANV    $Z0004CT    tdm_arkcmp     
[$Z000LSB] 001,00022612 001 DTM  ES--A-- $TM1        NONE        tm             
...

NID,PID show the node number and process ID. To translate the numeric node ID into a node name, use this command:

sqshell -c node

What do these processes do?

tm: Transaction manager, it is responsible for managing distributed transactions and for performing recovery.
mxosrvr: Connectivity server, this is the process JDBC/ODBC clients connect to. One mxosrvr handles one incoming connection, or it is idle.
tdm_arkcmp: SQL compiler, used in some cases to compile SQL queries in a separate process.

For more details, see Architecture. We are only concerned with troubleshooting problems on this page.

When a client connects through JDBC/ODBC, it first contacts a master process on one of the cluster nodes and obtains the IP address and port of an available mxosrvr, anywhere on the Trafodion cluster. To be able to connect, confirm that this port has been opened in the firewall. By default, the range of ports starts at 37800, and the size of the range depends on how many mxosrvrs are configured.

Another potential problem: A node may have multiple IP addresses, for example an internal network address and an external address. Since JDBC/ODBC connection requests get redirected to another node, DCS (Trafodion Database Connectivity Services) needs to give the correct IP address to the client. This is configured in the $DCS_INSTALL_DIR/conf/dcs-site.xml file, using property dcs.dns.interface.

Hive Table Access

A few things to consider:

Make sure that the Hive configuration directory (e.g. /etc/hive/conf) is in the class path
Try to connect manually to the Hive metastore, using the information from hive-site.xml

As the trafodion user, run the hive shell and try some simple commands

sudo su - trafodion
hive
  show tables;
  select count(*) from <some table>;
  quit;

Make sure that libhdfs is in the LD_LIBRARY_PATH. This is usually set up correctly by default.

Transactional Recovery

Trafodion implements the ability to recover transactions after an HBase crash. This feature is enabled by default.

Other scenarios, such as recovery from failure of individual RegionServers or the failure of the TM process, are not yet supported.

If you would like to disable recovery, set the following switches in the ms.env file and restart the Trafodion instance.

Edit the $MY_SQROOT/etc/ms.env file on every node and set this variable:
```
TM_ENABLE_TLOG_WRITES=0
```
This additional property in the hbase-site.xml file also controls recovery but does not need to be changed.
Property hbase.regionserver.region.transactional.tlog: This property controls the behavior of the regions during recovery.
- true – regions will request the TM to redrive in-doubt transactions.
- false – in-doubt transactions will be thrown away.
Example entry:
```
<property>
       <name>hbase.regionserver.region.transactional.tlog</name>
       <value>true</value>
</property>
```
This property is currently set to true.

Cleaning Trafodion and HBase Tables

When a Trafodion table is created, there will be configuration objects created in both Trafodion and HBase. There may be times where these two sets of configuration objects become inconsistent. This can result in the inability to drop and recreate the table for use in Trafodion. You might see Trafodion errors trying to drop objects.

SQL>drop table trafodion.seabase.table1;
*** ERROR[1389] Object TRAFODION.SEABASE.TABLE1 does not exist in Trafodion. [2014-06-10 14:12:51]

To recover from this problem, you need to ensure that all Trafodion and HBase configuration objects for the table have been removed. To remove a table from HBase, use the HBase shell to disable and drop the table.

hbase shell
  disable 'TRAFODION.SEABASE.TABLE1';
  drop 'TRAFODION.SEABASE.TABLE1';

You may also need to ensure all Trafodion metadata has been removed. You will query the Trafodion metadata to obtain the table's object_uid. Use the object_uid returned to the select query in the delete SQL statements as a replacement for <num>.

trafci
  select * from "_MD_".objects where object_name = 'TABLE1' and schema_name = 'SEABASE' and object_type = 'BT';

If a row is not returned, then the table configuration is not present in the Trafodion metadata. You can exit from "trafci". If a row is returned, use the value returned for the 'object_uid' column in the following delete statements. This will remove all metadata configuration information for the table.

  begin work;
  set parserflags 131072;
  delete from "_MD_".objects where object_uid = <num>;
  delete from "_MD_".columns where object_uid = <num>;
  delete from "_MD_".keys where object_uid = <num>;
  delete from "_MD_".tables where table_uid = <num>;
  delete from "_MD_".table_constraints where table_uid = <num>;

 If all is well at this point, then…

  commit work;
  exit;

Emergency Recovery

There may be times when you would like to refresh the Trafodion instance and matching HBase objects. NOTE: This will destroy all tables that were created in Trafodion! You can perform a series of steps that will stop the Trafodion instance and then clean out the HBase and ZooKeeper objects. You may then restart the Trafodion instance and reinitialize the Trafodion objects.

Bring down your Trafodion instance

sudo su - trafodion
# stop Trafodion
sqstop

Invoke an HBase shell to clean out all TRAFODION HBase tables

hbase shell
  disable_all 'TRAFODION.*'
  drop_all 'TRAFODION.*'
  exit

Start a ZooKeeper shell looking for HBase entries. If there are any entries matching a Trafodion object naming convention, please perform "rmr" to remove each individual entry.
```
/usr/lib/zookeeper/bin/zkCli.sh
  ls2 /hbase/table
  rmr /hbase/table/TRAFODION.<schema>.<table>
```
Check HDFS for any remaining Trafodion objects
- You need to run this using your own ID, as the trafodion user ID does not have sudo privilege.
```
sudo su hdfs --command "hadoop dfs -ls /hbase"
```
- You’ve already cleaned up all Trafodion tables from HBase, you shouldn’t see any HDFS directories or files with a name matching a Trafodion naming convention, e.g. /hbase/TRAFODION.XXX
- If you do, remove them. Here is an example to remove all directories and files starting with TRAFODION.CMPxxx:
```
 sudo su hdfs –command "hadoop dfs –rm –r /hbase/TRAFODION.CMP*"
```
- Please delete only the names you recognize as being associated with Trafodion.
- Once you have removed all lingering Trafodion files, repair the HDFS metadata:
```
sudo su hbase --command "hbase hbck -repair"
```
When you believe everything is clean, restart the HBase, ZooKeeper, MapReduce and HDFS Hadoop services.

Start your Trafodion instance:

sudo su - trafodion
# start Trafodion
sqstart

Reinitialize your Trafodion instance:

sudo su - trafodion
sqlci | cat
  initialize trafodion;
  exit;

Online Node Reintegration

Online node reintegration integrates an offline Trafodion node back into the cluster or instance for a previously failed (HWR or SWR) Trafodion node without taking the instance down. HWR node failures require that the node be repaired or replaced and then placed back into the original physical rack location. SWR node failures may be due to failure of the underlying Linux OS or due to fail-safe protection triggered by Trafodion. After the node is healthy and functional, ensure that the HBase environment on the node is also available. The final step is to reintegrate the node into the Trafodion instance. Currently, reintegration is a manually initiated operation.

In the following example, node ”sea-dev4” is being reintegrated into the Trafodion instance:

trafodion@sea-dev1 ~]$ sqshell -a
Processing cluster.conf on local host sea-dev1
[$Z000HAT] Shell/shell Version 1.0.1 Release 1.0.0 (Build debug [1.0.0rc2-6-ge7cad10_Bld72], date 20150125_083000)

[$Z000HAT] %node info
[$Z000HAT] Logical Nodes    = 6
[$Z000HAT] Physical Nodes   = 6
[$Z000HAT] Spare Nodes      = 0
[$Z000HAT] Available Spares = 0
[$Z000HAT] NID Type        State    Processors   #Procs
[$Z000HAT]     PNID        State        #Cores  MemFree SwapFree CacheFree Name
[$Z000HAT] --- ----------- -------- ---------- -------- -------- --------- --------
[$Z000HAT] 000 Any         Up                2        9
[$Z000HAT]     000         Up                8  6517656  4095996  10228088 sea-dev1
[$Z000HAT] 001 Any         Up                2       10
[$Z000HAT]     001         Up                8  6308828  4095992  11607900 sea-dev2
[$Z000HAT] 002 Any         Up                2        8
[$Z000HAT]     002         Up                8  1988592  3982764  11939032 sea-dev3
[$Z000HAT] 003 Any         Down    
[$Z000HAT]     003         Down                                            sea-dev4
[$Z000HAT] 004 Any         Up                2        8
[$Z000HAT]     004         Up                8  2081148  3832532  10454704 sea-dev5
[$Z000HAT] 005 Any         Up                2        8
[$Z000HAT]     005         Up                8   905744  3836724  10166456 sea-dev6

[$Z000HAT] %up sea-dev4
[$Z000HAT] 01/27/2015-15:58:48 - Node sea-dev4 is merging to existing cluster.
[$Z000HAT] 01/27/2015-15:58:49 - Node sea-dev4 join phase starting.
[$Z000HAT] %[$Z000HAT] 01/27/2015-15:58:54 - Node 3 (sea-dev4) is UP

[$Z000HAT] %[$Z000HAT] %node info
[$Z000HAT] Logical Nodes    = 6
[$Z000HAT] Physical Nodes   = 6
[$Z000HAT] Spare Nodes      = 0
[$Z000HAT] Available Spares = 0
[$Z000HAT] NID Type        State    Processors   #Procs
[$Z000HAT]     PNID        State        #Cores  MemFree SwapFree CacheFree Name
[$Z000HAT] --- ----------- -------- ---------- -------- -------- --------- --------
[$Z000HAT] 000 Any         Up                2        9
[$Z000HAT]     000         Up                8  6540244  4095996  10251084 sea-dev1
[$Z000HAT] 001 Any         Up                2       10
[$Z000HAT]     001         Up                8  6287700  4095992  11587408 sea-dev2
[$Z000HAT] 002 Any         Up                2        8
[$Z000HAT]     002         Up                8  2180728  3982764  11888216 sea-dev3
[$Z000HAT] 003 Any         Up                2        7
[$Z000HAT]     003         Up                8  1731684  3495860  11720212 sea-dev4
[$Z000HAT] 004 Any         Up                2        8
[$Z000HAT]     004         Up                8  2257312  3832532  10465852 sea-dev5
[$Z000HAT] 005 Any         Up                2        8
[$Z000HAT]     005         Up                8  1167184  3836724  10187724 sea-dev6
[$Z000HAT] %

Common Problems and Solutions

NOTE: If you do not see a problem listed here, see the known problems for the Trafodion release that you are using:

To troubleshoot problems related to data loading, see Troubleshooting Data Loading Problems.

Symptom: When trying to start Trafodion, you get the following loop:

# of Transactions being
recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....

Cause: DTM is not starting.

Solution: Common reasons and solutions:

HBase did not get restarted after trafodion_mods ran and did not pick up the setting modification that the script applied to HBase.

Possible reasons:The $MY_SQROOT/export/lib/ hbase-trx jar file that got copied to /usr/lib/hbase/lib does not exist or does not contain the correct permissions, which should be 644.The Hadoop Java version is different from what shows with the java -version command. To determine the Hadoop Java version, use top and then press C and look for the Java processes and their version. In Trafodion Release 0.8.n, the Hadoop Java version is either 1.6 or 1.7. Starting in Trafodion Release 0.9.0, the minimum Java 7 version required is 1.7.0_65. In Trafodion Release 1.1.0, the minimum Java 7 version required is 1.7.0_67. To force Hadoop to use the right version, run this command: yum -y remove.

There might actually be a DTM issue. Check $MY_SQROOT/sql/scripts/trafodion.dtm.log for "ERROR". If you see "ERROR" in the DTM log, please either file a bug or post a question in Launchpad. For more information on how bugs are tracked in Trafodion, see Bugs.

Symptoms:

The sqlci prompt during initial Trafodion startup hangs, OR
Error message like the following during initialize Trafodion

SQL>initialize trafodion

+>;

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::checkAndInsertRow returned error HBASE_ACCESS_ERROR(-706). Cause:

java.io.IOException: Problem with calling coprocessor, no regions returned result

org.apache.hadoop.hbase.client.transactional.TransactionalTable.checkAndPut(TransactionalTable.java:500)

org.apache.hadoop.hbase.client.transactional.RMInterface.checkAndPut(RMInterface.java:371)

org.trafodion.sql.HBaseAccess.HTableClient.putRow(HTableClient.java:1023)

org.trafodion.sql.HBaseAccess.HBaseClient.insertRow(HBaseClient.java:1490)

. [2015-09-23 23:31:53] *** ERROR[8839] Transaction was aborted. [2015-09-23 23:31:53]

Cause: There can be multiple potential issues for either symptom

The Cloudera HBase installation has more than one RegionServer Group configured. There should be only one RegionServer Group, RegionServer(Default).
The stdout for sqlci is not being displayed. This problem may be a display problem only, and the Trafodion installation may be progressing normally.

Solution:

If the Cloudera HBase installation has more than one RegionServer Group configured, remove the extra HBase RegionServer Groups.This can explain the HBASE_ACCESS_ERROR message.
If stdout for sqlci is not being displayed, first wait several minutes to allow the Trafodion installation to complete. Try steps 10 and 12 in the Trafodion Installation. These steps include attempting to log on to the VNC Server created for you at port 1. Log on as user trafodion, password traf123. If you can successfully log on, start a terminal window. Start a new "sqlci" session by entering sqlci at the command prompt. If "sqlci" successfully starts, execute the example in step 12 of the Trafodion Installation.

Symptom: While running update statistics or count(*) commands, you see org.apache.hadoop.hbase.client.ScannerTimeoutException errors.

Cause: The default hbase.rpc.timeout and hbase.client.scanner.timeout.period values might be too low given the size of the tables.

Solution: Increase the hbase.rpc.timeout and hbase.client.scanner.timeout.period values. We have found that increasing those values to 600 seconds (10 minutes) will prevent many timeout-related errors. For more information, see the HBase Configuration and Fine Tuning Recommendations.

Symptom: SQL queries fail with HBase OutOfOrderScannerNextException and WrongRegionException errors due to HBase regions being moved during region load balancing.

Cause: If HBase region load balancing is enabled for your installation and if regions are moved with active DTM transactions, there can be unpredictable results, as transactional context can be lost.

Solution: To avoid this problem, use the HBase shell command balance_switch false to disable the movement of a region from one server to another. For example:

# hbase shell
hbase(main):002:0> balance_switch false
true <-- Output will be the last setting of the balance_switch value
0 row(s) in 0.0080 seconds

Symptom: You see an ssh_exchange error when starting many client connections to the Trafodion system.

Cause: The settings for maximum sessions and maximum startups in OpenSSH are too low.

Solution: Modify the /etc/ssh/sshd_config file with these settings:

MaxSessions 100
MaxStartups 200

Then, restart the sshd service.

Space shortcuts

Page tree