No.JIRAAffected Version(s)Fix VersionIssueSteps to workaround this issue
14-2.4.x +-

Hiveserver2 can send a ton of metrics which causes performance issues with Metrics Collector.

Symptoms

  • AMS goes down intermittently.
  • AMS displays only host metrics (Host summary page on Ambari / System - Servers Grafana dashboard).
  • Aggregated data is not seen. (AMS Summary page / System - Home Grafana dashboard / HBase - Home Grafana dashboard) .

How do you find out if this is the issue?

  • Check out AMS metadata endpoint - http://<ams-host>:6188/ws/v1/timeline/metrics/metadata.
  • Use a JSON viewer to look at the JSON response. The number of metrics coming from every component is seen. AMS can handle around ~10000 unique metrics. Check out hivesever2 or any other component's metric count is causing an explosion of metrics.

 

  • Try increasing the heap settings for Metrics collector and HBase (Configurations - Tuning), since the system might be tuned for a small cluster while in fact is getting a lot of metrics.
  • Set
    • ams-site : timeline.metrics.service.resultset.fetchSize = 10000
    • ams-hbase-site : ams-hbase-site: hbase.regionserver.handler.count=30
  • If Ambari 2.5.x, set timeline.metrics.cluster.aggregation.sql.filters = sdisk_%,boottime,default.General%
  • Whitelist -
    • In Ambari 2.4.x the whitelisting feature is quite limited (only a metric whitelist file can be used).
    • However in Ambari 2.5.0, a lot of refinements were added to whitelisting.
    • Check out Ambari Metrics - Whitelisting.

 

13

AMBARI-200562.2.22.5.0

On large clusters AMS can become in-operational due to store file explosion and no compaction.

 

Consequence : Large number of store files (~10000) in AMS HBase and AMS shutting down regularly.
  1. Set ams-site : timeline.metrics.hbase.fifo.compaction.enabled = false
  2. Connect to HBase shell on Metrics collector host.
    /usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf shell
  3. Execute the following statement
alter 'METRIC_RECORD', CONFIGURATION => {'hbase.hstore.blockingStoreFiles' => '1000', 'hbase.hstore.defaultengine.compactionpolicy.class' =>
'org.apache.hadoop.hbase.regionserver.compactions.FIFOCompactionPolicy'}


If the above does not solve the issue, the only way to recover the system is to reset the metric system.

12AMBARI-180932.2.22.4.0On large clusters, if the TTL of high precision tables is more than 3 days, it leads to too much data and too many regions in AMS HBase. It is better to have a smaller ttl for the higher precision data. The 5 minute aggregate data will still be available for 7 days to work with.
  1. Following config changes in "ams-site" from the UI

    timeline.metrics.host.aggregator.ttl : 1 day
    timeline.metrics.cluster.aggregator.second.ttl : 3 days

  2. Restart AMS collector
11 AMBARI-17779 2.2.1, 2.2.22.4.0

 The HBase normalizer which automatically splits / merges regions based on region size was leveraged in AMS (2.2.1). However, due to over aggressive region splitting by normalizer sometimes, in large clusters it could lead to an explosion of regions. This will eventually lead to AMS crashing every time it starts up.

As of 2.2.2, the AMS HBase normalizer cannot not be disabled through AMS configs.

 

Instructions for disabling normalizer on AMS HBase tables.

1. su ams (kinit if needed)
2. Connect to HBase shell on Metrics collector host.
/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf shell
3. Execute the following statements one by one.

 

 alter 'METRIC_RECORD', {NORMALIZATION_ENABLED => 'false'} 
 alter 'METRIC_AGGREGATE', {NORMALIZATION_ENABLED => 'false'}
 alter 'METRIC_RECORD_MINUTE', {NORMALIZATION_ENABLED => 'false'} 
 alter 'METRIC_AGGREGATE_MINUTE', {NORMALIZATION_ENABLED => 'false'}
 alter 'METRIC_RECORD_HOURLY', {NORMALIZATION_ENABLED => 'false'}
 alter 'METRIC_AGGREGATE_HOURLY', {NORMALIZATION_ENABLED => 'false'}
 alter 'METRIC_RECORD_DAILY', {NORMALIZATION_ENABLED => 'false'}
 alter 'METRIC_AGGREGATE_DAILY', {NORMALIZATION_ENABLED => 'false'}

 

4. Verify the configuration change took effect. Open the HBase master UI on a browser (http://<collector_host>:61310), and search for the String "NORMALIZATION". It should return no matches.

10AMBARI-154922.2.12.2.2

Ambari metrics collector shuts down and restarts randomly. The following error is seen in the collector log.

ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: RECEIVED SIGNAL 15: SIGTERM

 
  • Comment out these 2 properties in /etc/ambari-server/conf/ambari.properties
    • #recovery.enabled_components=METRICS_COLLECTOR
    • #recovery.type=AUTO_START
  • Restart Ambari Server.
9AMBARI-137582.1.2 and lower2.2.0When Ambari Metrics collector host is moved from one host to another, host metrics are not seen
  • Restart Ambari Server.
  • Restart Ambari Metrics Monitor
8AMBARI-142572.1.22.2.0Storm metrics are not seen after upgrading to Ambari 2.1.2

On every host with a storm component [nimbus / supervisor / client] , carry out the following steps.

  1. Verify broken link
    ls -al /usr/hdp/current/storm-<component>/lib/ambari-metrics-storm-sink.jar  
    Remove symlink
  2. rm -f /usr/hdp/current/storm-<component>/lib/ambari-metrics-storm-sink.jar
    Reattach symlink to new JAR
  3. ln -s /usr/lib/storm/lib/ambari-metrics-storm-sink*.jar 
    /usr/hdp/current/storm-<component>/lib/ambari-metrics-storm-sink.jar
7 AMBARI-137982.1.22.2.0

Ambari Metrics service graphs might not show data for certain metrics and the the following error might be seen in the metrics collector log (Ambari 2.1.2, 2.1.2.1).

"The time range query for precision table exceeds row count limit, please query aggregate table instead"

  • Change the ams-site configuration, Set timeline.metrics.service.default.result.limit = 15840
  •  Restart the Collector
6AMBARI-137112.1.2 

 Ambari Metrics Server wont start successfully with Kerberos in Distributed Mode (AMBARI-13711)

The problem is, we cannot have separate principals for HBase Master and RS. The Zookeeper ACLs will not allow znode created using one principal to be read by the other unless proper ACL are set.

Since master created the znode with a different principal than RS in 2.1.2 this will happen.

Change the AMS configuration to use the Master keytab and principal for RS
Set,

  • ams-hbase-site ::: hbase.regionserver.keytab.file = /etc/security/keytabs/ams-hbase.master.keytab
  • ams-hbase-site ::: hbase.regionserver.kerberos.principal = amshbasemaster/_HOST@REALM

Restart the Collector

5   Metrics data for last one month is missing many data points that should exist.

- check /var/log/ambari-metrics-collector/ambari-metrics-collector.log for metrics data aggregation errors like OutOfOrderScannerNextException or SpoolTooBigToDiskException

- Set bigger value for hbase_regionserver_heapsize property in Advanced ams-hbase-env using Ambari Web UI

- Restart Metrics Collector.

4

AMBARI-11501,

AMBARI-12347

2.0.x and 2.1.0 AMS HBase does not start after Kerberization in distributed mode, 2.0.x and 2.1.0 (Note: Also look at issue: 1)

Issue 1:  (AMBARI-11501)
Steps to workaround this issue:

On the ambari server host

  • cd /var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/params.py
  • Edit hbase_staging_dir and point it to the desired location on HDFS, e.g.: "/ams-hbase/staging"
  • Restart Ambari server.
  • Restart Metrics Collector.

Issue 2: (AMBARI-12347)
Steps to workaround this issue:

ams.zookeeper.principal   =   zookeeper/_HOST@EXAMPLE.COM (Substitute appropriate REALM)

ams.zookeeper.keytab      =   /etc/security/keytabs/zk.service.keytab

Note: This is assuming you have a Zookeeper keytab on the host with Metrics Collector.

If not you should create one with appropriate permissions.

If a keytab already exists, make sure to chmod 440 /etc/security/keytabs/zk.service.keytab

Example: ]# klist -kt /etc/security/keytabs/zk.service.keytab
Keytab name: FILE:/etc/security/keytabs/zk.service.keytab
KVNO Timestamp         Principal
---- ----------------- --------------------------------------------------------
   1 07/08/15 22:29:07 zookeeper/ambari-sid-3.c.pramod-thangali.internal@EXAMPLE.COM
  • Restart Metrics Collector
3 2.0.0, 2.1.0, 2.1.12.1.2Alter TTL is not supported by the version of Phoenix used with Ambari 2.0.0, 2.1.0, 2.1.1. This property can be modified using HBase Shell command.
~]$ su - ams
~]$ export JAVA_HOME=/usr/jdk64/jdk1.8.0_40/
~]$ /usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf shell

hbase(main):007:0> describe 'METRIC_RECORD'
------- HBase output describing table information ---------------
hbase(main):009:0> alter 'METRIC_RECORD', { NAME => '0', TTL => 172800}
hbase(main):007:0> describe 'METRIC_RECORD'
------- HBase output describing table information with new TTL ---------------
2 2.0.x2.1.0Ambari Metrics service does not work after enabling security with AMS in distributed mode in Ambari 2.0.x
  1. Copy core-site.xml and hdfs-site.xml to metric collector config directories after HA is enabled.

    cp /etc/hadoop/conf/core-site.xml /etc/ambari-metrics-collector/conf/
    cp /etc/hadoop/conf/hdfs-site.xml /etc/ambari-metrics-collector/conf/
    cp /etc/hadoop/conf/core-site.xml /etc/ams-hbase/conf/
    cp /etc/hadoop/conf/hdfs-site.xml /etc/ams-hbase/conf/
  2. Restart the Metrics Collector Component.

1AMBARI-107072.0.x2.1.0Ambari Metrics service does not work with NN HA in distributed mode in Ambari 2.0.x 

 

 

 

  • No labels

1 Comment

  1. In case of embedded mode in a kerberized environment, we should not be having hdfs-site.xml & core-site.xml in /etc/ams-hbase/conf & /etc/ambari-metrics-server/conf.  If they are present , AMS considers it to be a Kerberos authenticated system and looks for a keytab in the AMS. But currently there isn't one required for AMS in embeddedvmode.In such a situation, starting up AMS collector would fail as follows:

    at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2979) 

    Caused by: java.io.IOException: Running in secure mode, but config doesn't have a keytab 

    at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:236) 

    Removing the hdfs-site.xml and core-site.xml from the AMS config locations would resolve this issue.