Hive (10): Hive's partition and bucket tables

Catalog

0. Links to related articles

1. Partition table

1.1. Partition table basic operations

1.2. Secondary partition

1.3. Dynamic partition adjustment

1.3.1. Turn on dynamic partition parameter settings

1.3.2. Case Practice

2. Barrel table

2.1. Create bucket table

2.2. Notes for barrel table operation:

2.3. Importing data into bucket table by insert

3. Sampling queries

0. Links to related articles

 Summary of Hive Articles 

1. Partition table

The partition table is essentially a separate folder on the HDFS file system that contains all the data files for that partition. Partitions in Hive are subdirectories that divide a large dataset into smaller datasets based on business needs. Selecting the specified partition required by the query through the expression in the WHERE clause during the query can greatly improve the efficiency of the query.

1.1. Partition table basic operations

1) Introduce partitioned tables (log management based on date, simulated by Department information)

dept_20200401.log
dept_20200402.log
dept_20200403.log

2) Create partition table syntax

create table dept_partition(
    deptno int
    , dname string
    , loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

Note: Partition fields cannot be data that already exists in the table. Partition fields can be considered pseudo columns of the table.

3) Load data into partition table

  • Data preparation
dept_20200401.log
10	ACCOUNTING	1700
20	RESEARCH	1800

-- ======================================

dept_20200402.log
30	SALES	1900
40	OPERATIONS	1700

-- ======================================

dept_20200403.log
50	TEST	2000
60	DEV	1900
  • Loading data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

Note: Partitions must be specified when partition tables load data

4) Query partition table data

-- Single partition query:
hive (default)> select * from dept_partition where day='20200401';

-- Multi-partition federated query:
hive (default)> select * from dept_partition where day='20200401'
              union
              select * from dept_partition where day='20200402'
              union
              select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or day='20200402' or day='20200403';	

5) Increase partitions

-- Create a single partition:
hive (default)> alter table dept_partition add partition(day='20200404');
-- Create multiple partitions at the same time:
hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

6) Delete partitions

-- Delete a single partition:
hive (default)> alter table dept_partition drop partition (day='20200406');
-- Delete multiple partitions at the same time:
hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

7) See how many partitions are in the partition table

hive> show partitions dept_partition;

8) View partition table structure

hive> desc formatted dept_partition;

# Partition Information          
# col_name              data_type               comment             
month                   string    

1.2. Secondary partition

If you have a large amount of log data in a day, how do you split it up again?

1) Create a secondary partition table

hive (default)> create table dept_partition2(
               deptno int, dname string, loc string
               )
               partitioned by (day string, hour string)
               row format delimited fields terminated by '\t';

2) Normal loading data

-- 1.Load data into a secondary partition table:
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12');

-- 2.Query partition data
hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

3) Three ways to associate partition tables with data by uploading data directly to a partition directory

Method 1: Repair after uploading data

-- Upload data:
hive (default)> dfs -mkdir -p  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

-- Query data (no data just uploaded can be queried):
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

-- Execute Repair Command
hive> msck repair table dept_partition2;

-- Query data again
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

Mode 2: Add partitions after uploading data

-- Upload data:
hive (default)> dfs -mkdir -p  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

-- Execute Add Partition
hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

-- Query Data
hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

Mode 3: load data to partition after creating folder

-- Create directory
hive (default)> dfs -mkdir -p  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

-- Upload data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15');

-- Query Data
hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

1.3. Dynamic partition adjustment

In relational databases, when inserting partition table Insert data, the database automatically inserts data into the appropriate partition based on the value of the partition field. Hive also provides a similar mechanism, dynamic partition, except that using Hive's dynamic partition requires appropriate configuration.

1.3.1. Turn on dynamic partition parameter settings

1) Turn on dynamic partitioning (default true, on)

hive.exec.dynamic.partition=true

2) Set to non-strict mode (dynamic partitioning mode, default strict, means that at least one partition must be specified as static partition, and non-strict mode means that dynamic partitioning is allowed for all partition fields.)

hive.exec.dynamic.partition.mode=nonstrict

3) The maximum number of dynamic partitions that can be created on all MR nodes. Default 1000

hive.exec.max.dynamic.partitions=1000

4) Maximum number of dynamic partitions can be created at each node where MR is performed. This parameter needs to be set based on the actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, then the parameter needs to be set to be greater than 365, and if the default value of 100 is used, an error will be reported.

hive.exec.max.dynamic.partitions.pernode=100

5) How many HDFS files can be created in the whole MR Job. Default 100000

hive.exec.max.created.files=100000

6) Whether an exception is thrown when an empty partition is generated. Settings are generally not required. Default false

hive.error.on.empty.partition=false

1.3.2. Case Practice

Requirement: Insert data from dept table into target table dept_by region (loc field) Partition in the corresponding partition.

1) Create target partition table

hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

2) Setting up dynamic partitions

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

3) View partitions of the target partition table

hive (default)> show partitions dept_partition;

Think: How does the target partition table match to the partition field?

2. Barrel table

Partitions provide a convenient way to isolate data and optimize queries. However, not all datasets form reasonable partitions. For a table or partition, Hive can be further organized into barrels, which are more granular data range partitions. Bucketing is another technique for breaking up data into more manageable parts. Partitions are for data storage paths; Buckets are for data files.

2.1. Create bucket table

1) Data preparation

1001	ss1
1002	ss2
1003	ss3
1004	ss4
1005	ss5
1006	ss6
1007	ss7
1008	ss8
1009	ss9
1010	ss10
1011	ss11
1012	ss12
1013	ss13
1014	ss14
1015	ss15
1016	ss16

2) Create bucket table

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

3) View table structure

hive (default)> desc formatted stu_buck;
Num Buckets:            4     

4) How to load data into bucket table

hive (default)> load data inpath  '/student.txt' into table stu_buck;

5) See if the barrel table created is divided into four barrels

6) Query bucket-wise data

hive(default)> select * from stu_buck;

7) Bucketing rules:

Hive's barrel hashes the values of the barrel field and divides them by the number of barrels to determine which barrel the record is stored in.

2.2. Notes for barrel table operation:

  • Set the number of reduces to -1, letting Job decide how many reduces to use or set the number of reduces to be greater than or equal to the number of buckets in the bucket table
  • load data from hdfs into bucket table to avoid local file missing
  • Do not use local mode

2.3. Importing data into bucket table by insert

hive(default)>insert into table stu_buck select * from student_insert;

3. Sampling queries

For very large datasets, sometimes the user needs to use a representative query result instead of the whole result. Hive can do this by sampling tables.

Syntax: TABLESAMPLE(BUCKET x OUT OF y)

Query table stu_ Data in buck:

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

Note: The value of x must be less than or equal to the value of y, otherwise

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

Note: Links to other Hive related series articles are entered here - > Summary of Hive Articles

Tags: Big Data hive

Posted by Roddy87 on Fri, 26 Aug 2022 22:03:20 +0530