Today I have a requirement to implement Hive disaster recovery. I have to take a backup of Hive metadata and all hive DB’s. Our development team is using different databases which reside under /user/hive/warehouse in HDFS.
To take backup of hive database we can follow Cloudera official documentation given in this link.
Before moving ahead let’s understand few things:
“Hive/Impala replication enables you to copy (replicate) your Hive metastore and data from one cluster to another and synchronize the Hive metastore and data set on the destination cluster with the source, based on a specified replication schedule. The destination cluster must be managed by the Cloudera Manager Server where the replication is being set up, and the source cluster can be managed by that same server or by a peer Cloudera Manager Server”
I don’t have a separate BDR cluster so I have to move my back to S3.
Prerequisites
- Backup to and restore from Amazon S3 is supported from CM 5.9 onwards and CDH 5.9 onwards.
- When Hive data is backed up to Amazon S3 with a CDH version, the same data can be restored to the same CDH version.
Step-1: Setup AWS Credentials
As my cluster is provisioned on EC2 instance through IAM Role-based Authentication so I don’t need to do anything extra to configure this.
” If you are configuring Amazon S3 access for a cluster deployed to Amazon Elastic Compute Cloud (EC2) instances using the IAM role for the EC2 instance profile, you do not need configure IAM role-based authentication for services such as Impala, Hive, or Spark.”
Just go to CM > Administration > click on IAM Role-based Authentication
Click Add. In the next screen you will see S3Guard setting, we can skip this as we don’t need this at this time.
Click Save and you will see below configuration after enabling cluster for S3 access.This is the way we configure AWS credentials.
Configure Hive/Impala replication to or from S3
- Select Backup > Replication Schedules.
- Click Create Schedule > Hive Replication.
- To back up data to S3:
- Select the Source cluster from the Source drop-down list.
- Select the S3 or ADLS destination (one of the AWS Credentials or ADLS Credentials you created) from the Destination drop-down list.
- Enter the path where the data should be copied to in S3.
For S3, use the following form:
s3a://S3_bucket_name/path - Select one of the following Replication Options
- Metadata and Data – Backs up the Hive data from HDFS and its associated metadata.
- Metadata only – Backs up only the Hive metadata.
Following screenshot will give more clarity
We can schedule it, or we can do an immediate replication as shown below:Once it’s done we can check the progress of the replication in logs.
How to check the backup at S3?
Through HDFS
hadoop fs -ls s3a://S3_bucket_name/path
Through AWS CLI
aws s3 ls s3://S3_bucket_name/path
Basic commands for S3
Delete files from S3: I want to delete samplefile.txt from S3 bucket
hadoop fs -rm -skipTrash s3a://S3_bucket_name/samplefile.txt
Delete Direcories: I want to delete test1 directory from S3 bucket
hadoop fs -rm -r -skipTrash s3a://S3_bucket_name/test1