The principle of HBase BulkLoad and the practice of writing data in batches

Table of contents

1. Introduction to HBase BulkLoad

  1. foreword
  2. Why use bulkload to import?
  3. The realization principle of bulkload

2. HBase BulkLoad writes data in batches

  1. Develop code to generate HFile files
  2. Make it into a jar package and submit it to the cluster for running
  3. Observe the output results on HDFS
  4. Load the HFile file into the hbase table
  5. Summarize

1. Introduction to HBase BulkLoad

1 Introduction

We have introduced the storage mechanism of HBASE before. The bottom layer of HBASE storage data is HDFS as the storage medium. Each table of HBASE corresponds to a folder on the HDFS directory. The folder name is named after the name of the HBASE table. (If no namespace is used, the default is in the default directory). There are several folders named region under the table folder, and each column family in the region folder is also stored in a folder, and each column family stores the actual data, which exists in the form of HFile .

Path format: /hbase/data/default/<tbl_name>/<region_id>//<hfile_id>

2. Why use bulkload to import?

In data transmission, there are many ways to batch load data to the HBase cluster, such as batch writing data through the HBase API, using the Sqoop tool to batch derivatives to the HBase cluster, and using MapReduce to batch import, etc. In these methods, if the amount of data is too large during the process of importing data, it may take a lot of time or occupy more HBase cluster resources (such as disk IO, number of HBase Handler s, etc.).

3. The realization principle of bulkload

According to the principle of storing data in HBase in HDFS in HFile format, MapReduce is used to directly generate data files in HFile format, and then the HFile data files are moved to the corresponding Region through the RegionServer.

  • HBase data normal writing process
  • Schematic diagram of bulkload processing
  • The benefits of bulkload
  1. The import process does not occupy Region resources
  2. Can quickly import massive amounts of data
  3. save memory

2. HBase BulkLoad writes data in batches

need

Through the bulkload method, the data file in the path /hbase/input/user.txt we put on HDFS is converted into HFile format, and then loaded into the Hbase table myuser2.

1. Develop code to generate HFile files

custom map class

package xsluo.hbase;
 
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
 
//The latter two of the four generic types correspond to rowkey and put respectively
public class BulkLoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
 
        //The rowkey type of the package output
        ImmutableBytesWritable immutableBytesWritable = new ImmutableBytesWritable(split[0].getBytes());
        //Build the Put object
        Put put = new Put(split[0].getBytes());
        put.addColumn("f1".getBytes(), "name".getBytes(), split[1].getBytes());
        put.addColumn("f1".getBytes(), "age".getBytes(), split[2].getBytes());
 
        context.write(immutableBytesWritable, put);
    }
}
copy

program main

package xsluo.hbase;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
public class HBaseBulkLoad extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        //Set up a ZK cluster
        configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
 
        int run = ToolRunner.run(configuration, new HBaseBulkLoad(), args);
        System.exit(run);
    }
 
    @Override
    public int run(String[] strings) throws Exception {
        Configuration conf = super.getConf();
        Job job = Job.getInstance(conf);
        job.setJarByClass(HBaseBulkLoad.class);
 
        FileInputFormat.addInputPath(job, new Path("hdfs://node01:8020/hbase/input"));
        job.setMapperClass(BulkLoadMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
 
        //Get database connection
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("myuser2"));
 
        //Enable MR to incrementally add data to the myuser2 table
        HFileOutputFormat2.configureIncrementalLoad(job, table, connection.getRegionLocator(TableName.valueOf("myuser2")));
        //The data is written back to HDFS and written as HFile -> so the specified output format is HFileOutputFormat2
        job.setOutputFormatClass(HFileOutputFormat2.class);
        HFileOutputFormat2.setOutputPath(job, new Path("hdfs://node01:8020/hbase/out_hfile"));
 
        //start execution
        boolean b = job.waitForCompletion(true);
 
        return b?0:1;
    }
}
copy
2. Make a jar package and submit it to the cluster for running
hadoop jar hbase_demo-1.0-SNAPSHOT.jar xsluo.hbase.HBaseBulkLoad
copy
3. Observe the output results on HDFS
4. Load the HFile file into the hbase table

Method 1: Code loading

package xsluo.hbase;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
 
public class LoadData {
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.quorum", "node01,node02,node03");
        //Get database connection
        Connection connection =  ConnectionFactory.createConnection(configuration);
        //Get the manager object for the table
        Admin admin = connection.getAdmin();
        //Get the table object
        TableName tableName = TableName.valueOf("myuser2");
        Table table = connection.getTable(tableName);
        //Build LoadIncrementalHFiles to load HFile files
        LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration);
        load.doBulkLoad(new Path("hdfs://node01:8020/hbase/out_hfile"), admin,table,connection.getRegionLocator(tableName));
    }
}
copy

Method 2: command loading

First add the hbase jar package to the hadoop classpath

export HBASE_HOME=/xsluo/install/hbase-1.2.0-cdh5.14.2/
export HADOOP_HOME=/xsluo/install/hadoop-2.6.0-cdh5.14.2/
export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`
copy
  • run command
yarn jar /xsluo/install/hbase-1.2.0-cdh5.14.2/lib/hbase-server-1.2.0-cdh5.14.2.jar   completebulkload /hbase/out_hfile myuser2
copy

After loading, the original file will be moved to the column family directory of the region corresponding to this table in hdfs

5. Summary

In order to demonstrate the actual combat effect, this article decomposes the steps of generating HFile files and importing HFiles to HBase clusters using BulkLoad. In actual situations, these two steps can be combined into one to realize automatic generation and automatic import of HFiles. If an RpcRetryingCaller exception occurs during execution, you can view the log information on the corresponding RegionServer node, which records the detailed reasons for the exception.

Tags: Big Data HBase jar Storage

Posted by funkyfela on Mon, 05 Dec 2022 07:41:56 +0530