Table of contents
1. Introduction to HBase BulkLoad
- foreword
- Why use bulkload to import?
- The realization principle of bulkload
2. HBase BulkLoad writes data in batches
- Develop code to generate HFile files
- Make it into a jar package and submit it to the cluster for running
- Observe the output results on HDFS
- Load the HFile file into the hbase table
- Summarize
1. Introduction to HBase BulkLoad
1 Introduction
We have introduced the storage mechanism of HBASE before. The bottom layer of HBASE storage data is HDFS as the storage medium. Each table of HBASE corresponds to a folder on the HDFS directory. The folder name is named after the name of the HBASE table. (If no namespace is used, the default is in the default directory). There are several folders named region under the table folder, and each column family in the region folder is also stored in a folder, and each column family stores the actual data, which exists in the form of HFile .
Path format: /hbase/data/default/<tbl_name>/<region_id>//<hfile_id>
2. Why use bulkload to import?
In data transmission, there are many ways to batch load data to the HBase cluster, such as batch writing data through the HBase API, using the Sqoop tool to batch derivatives to the HBase cluster, and using MapReduce to batch import, etc. In these methods, if the amount of data is too large during the process of importing data, it may take a lot of time or occupy more HBase cluster resources (such as disk IO, number of HBase Handler s, etc.).
3. The realization principle of bulkload
According to the principle of storing data in HBase in HDFS in HFile format, MapReduce is used to directly generate data files in HFile format, and then the HFile data files are moved to the corresponding Region through the RegionServer.
- HBase data normal writing process

- Schematic diagram of bulkload processing

- The benefits of bulkload
- The import process does not occupy Region resources
- Can quickly import massive amounts of data
- save memory
2. HBase BulkLoad writes data in batches
need
Through the bulkload method, the data file in the path /hbase/input/user.txt we put on HDFS is converted into HFile format, and then loaded into the Hbase table myuser2.
1. Develop code to generate HFile files
custom map class
copypackage xsluo.hbase; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; //The latter two of the four generic types correspond to rowkey and put respectively public class BulkLoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] split = value.toString().split("\t"); //The rowkey type of the package output ImmutableBytesWritable immutableBytesWritable = new ImmutableBytesWritable(split[0].getBytes()); //Build the Put object Put put = new Put(split[0].getBytes()); put.addColumn("f1".getBytes(), "name".getBytes(), split[1].getBytes()); put.addColumn("f1".getBytes(), "age".getBytes(), split[2].getBytes()); context.write(immutableBytesWritable, put); } }
program main
copypackage xsluo.hbase; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Table; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class HBaseBulkLoad extends Configured implements Tool { public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); //Set up a ZK cluster configuration.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181"); int run = ToolRunner.run(configuration, new HBaseBulkLoad(), args); System.exit(run); } @Override public int run(String[] strings) throws Exception { Configuration conf = super.getConf(); Job job = Job.getInstance(conf); job.setJarByClass(HBaseBulkLoad.class); FileInputFormat.addInputPath(job, new Path("hdfs://node01:8020/hbase/input")); job.setMapperClass(BulkLoadMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); //Get database connection Connection connection = ConnectionFactory.createConnection(conf); Table table = connection.getTable(TableName.valueOf("myuser2")); //Enable MR to incrementally add data to the myuser2 table HFileOutputFormat2.configureIncrementalLoad(job, table, connection.getRegionLocator(TableName.valueOf("myuser2"))); //The data is written back to HDFS and written as HFile -> so the specified output format is HFileOutputFormat2 job.setOutputFormatClass(HFileOutputFormat2.class); HFileOutputFormat2.setOutputPath(job, new Path("hdfs://node01:8020/hbase/out_hfile")); //start execution boolean b = job.waitForCompletion(true); return b?0:1; } }
2. Make a jar package and submit it to the cluster for running
copyhadoop jar hbase_demo-1.0-SNAPSHOT.jar xsluo.hbase.HBaseBulkLoad
3. Observe the output results on HDFS


4. Load the HFile file into the hbase table
Method 1: Code loading
copypackage xsluo.hbase; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Admin; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.client.Table; import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles; public class LoadData { public static void main(String[] args) throws Exception { Configuration configuration = HBaseConfiguration.create(); configuration.set("hbase.zookeeper.quorum", "node01,node02,node03"); //Get database connection Connection connection = ConnectionFactory.createConnection(configuration); //Get the manager object for the table Admin admin = connection.getAdmin(); //Get the table object TableName tableName = TableName.valueOf("myuser2"); Table table = connection.getTable(tableName); //Build LoadIncrementalHFiles to load HFile files LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration); load.doBulkLoad(new Path("hdfs://node01:8020/hbase/out_hfile"), admin,table,connection.getRegionLocator(tableName)); } }
Method 2: command loading
First add the hbase jar package to the hadoop classpath
copyexport HBASE_HOME=/xsluo/install/hbase-1.2.0-cdh5.14.2/ export HADOOP_HOME=/xsluo/install/hadoop-2.6.0-cdh5.14.2/ export HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`
- run command
copyyarn jar /xsluo/install/hbase-1.2.0-cdh5.14.2/lib/hbase-server-1.2.0-cdh5.14.2.jar completebulkload /hbase/out_hfile myuser2
After loading, the original file will be moved to the column family directory of the region corresponding to this table in hdfs
5. Summary
In order to demonstrate the actual combat effect, this article decomposes the steps of generating HFile files and importing HFiles to HBase clusters using BulkLoad. In actual situations, these two steps can be combined into one to realize automatic generation and automatic import of HFiles. If an RpcRetryingCaller exception occurs during execution, you can view the log information on the corresponding RegionServer node, which records the detailed reasons for the exception.