How to resolve and troubleshoot Hive DB lock issue in production
Issue Description: Job is failing or running very slow in production environment.
Error in logs: Check the application log and found below error message in the failed job.
java.lang.RuntimeException: Error while compiling statement: FAILED: Error in acquiring locks: Error communicating with the metastore
or
java.lang.RuntimeException: Could not acquire lock
Troubleshooting:
Step1: we are using postgres database so connect to the DB and login to hive metastore database.
$sudo su – postgres$psql$ \c metastore #Switch to hive metastore DB
STEP2: Execute following commands to collect locks detail.
metastore=# select count(1) from "HIVE_LOCKS";
This command will display locks on DB
metastore=# SELECT COUNT(1) FROM "TBL_PRIVS";
If the value is more 600 then it means SELECT, INSERT,DELETE and UPDATE operation are exceeding the limit.
metastore=# SELECT * FROM "TBL_COL_PRIVS";
Check what is the value you are getting with this. Ideal value should be 0.
metastore=# SELECT COUNT(1) FROM "COMPACTION_QUEUE";
This will show the current compaction status. If it's hue we should control auto compaction.
metastore=# SELECT COUNT(1) FROM "HIVE_LOCKS" WHERE "HL_LAST_HEARTBEAT" = 0;
If the value is high that means locks are not releasing.
metastore=# select count(*) from HIVE_LOCKS where hl_table like '%study%';
If you have any table detail check the locks on the table
STEP3: Execute following command on Hue and check the status of DB locks
# SHOW LOCKS;
# SHOW LOCKS EXTENDED;
# SHOW TRANSACTIONS;
From above analysis we will get locks details and normal range of locks may be ~1500 if its more like 1~10000 then we need to tune the job or server configuration.
Use Case1: In this use case we found that a "merge" hive statement was trying to acquire exclusive locks on a lot of partitions (~1,386 ).
This request times out, likely due to the very high number of partitions. This statement may require tuning that falls outside support scope to run efficiently.
If the developer can modify their statement to avoid this, that would be the best solution.
Temporary fix: Increasing the following timeouts may allow the statement to run, but that would be a temporary solution to unblock the application:
- Navigate to CM -> Hive on Tez -> Configuration -> Hive Metastore Connection Timeout
- Increase to 20 minutes or more
- Save changes
- Navigate to CM -> Hive -> Configuration -> edit “Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml”
Add Property:
- Name: hive.metastore.client.socket.timeout
- Value: 1200
- Save changes
- Restart Hive and Hive on Tez
Use Case2: If there is no issue in hive statement then follow below steps to tune the hive server configuration.
Initiating automatic compaction should only be done by a single HMS as detailed in Cloudera's official documentation.
Log into Cloudera Manager, select the ‘HIVE’ service, and click the Configuration tab.
Disabled compactor and housekeeping from global config and just enable it on a single instances. Under the Configuration tab, Search for “hive.compactor.initiator.on“, deselect the option, and save the change. Do the same for “hive.metastore.housekeeping.threads.on“. make sure to remove it from Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml
- Next, go to the “Instances” tab and select the Hive Metastore that will handle the compaction task. This will take you to the HMS’ instance page. From here select the configuration tab, and search again for the “hive.compactor.initiator.on” property. It should be labeled “Hive Metastore Server (<hostname here>) to show the change will only be applied to this single instance and not the entire service level. On the same node repeat the process for “hive.metastore.housekeeping.threads.on“.
- Add the parameter hive.server2.tez.initialize.default.sessions into HIVE ON TEZ configuration
- Navigate to CM -> Hive on Tez -> Configuration -> Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xm. Add following property
- Kay = hive.server2.tez.initialize.default.sessions
- Value = false
- Navigate to CM -> Hive on Tez -> Configuration -> Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xm. Add following property
- Deploy Client configuration
- Restart HMS and HS2.
Finally restart the job and monitor locks. Issue must be resolve now.