Catalog
1.1. Partition table basic operations
1.3. Dynamic partition adjustment
1.3.1. Turn on dynamic partition parameter settings
2.2. Notes for barrel table operation:
2.3. Importing data into bucket table by insert
0. Links to related articles
1. Partition table
The partition table is essentially a separate folder on the HDFS file system that contains all the data files for that partition. Partitions in Hive are subdirectories that divide a large dataset into smaller datasets based on business needs. Selecting the specified partition required by the query through the expression in the WHERE clause during the query can greatly improve the efficiency of the query.
1.1. Partition table basic operations
1) Introduce partitioned tables (log management based on date, simulated by Department information)
dept_20200401.log dept_20200402.log dept_20200403.log
2) Create partition table syntax
create table dept_partition( deptno int , dname string , loc string ) partitioned by (day string) row format delimited fields terminated by '\t';
Note: Partition fields cannot be data that already exists in the table. Partition fields can be considered pseudo columns of the table.
3) Load data into partition table
- Data preparation
dept_20200401.log 10 ACCOUNTING 1700 20 RESEARCH 1800 -- ====================================== dept_20200402.log 30 SALES 1900 40 OPERATIONS 1700 -- ====================================== dept_20200403.log 50 TEST 2000 60 DEV 1900
- Loading data
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401'); hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402'); hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');
Note: Partitions must be specified when partition tables load data
4) Query partition table data
-- Single partition query: hive (default)> select * from dept_partition where day='20200401'; -- Multi-partition federated query: hive (default)> select * from dept_partition where day='20200401' union select * from dept_partition where day='20200402' union select * from dept_partition where day='20200403'; hive (default)> select * from dept_partition where day='20200401' or day='20200402' or day='20200403';
5) Increase partitions
-- Create a single partition: hive (default)> alter table dept_partition add partition(day='20200404'); -- Create multiple partitions at the same time: hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');
6) Delete partitions
-- Delete a single partition: hive (default)> alter table dept_partition drop partition (day='20200406'); -- Delete multiple partitions at the same time: hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');
7) See how many partitions are in the partition table
hive> show partitions dept_partition;
8) View partition table structure
hive> desc formatted dept_partition; # Partition Information # col_name data_type comment month string
1.2. Secondary partition
If you have a large amount of log data in a day, how do you split it up again?
1) Create a secondary partition table
hive (default)> create table dept_partition2( deptno int, dname string, loc string ) partitioned by (day string, hour string) row format delimited fields terminated by '\t';
2) Normal loading data
-- 1.Load data into a secondary partition table: hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12'); -- 2.Query partition data hive (default)> select * from dept_partition2 where day='20200401' and hour='12';
3) Three ways to associate partition tables with data by uploading data directly to a partition directory
Method 1: Repair after uploading data
-- Upload data: hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; hive (default)> dfs -put /opt/module/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13; -- Query data (no data just uploaded can be queried): hive (default)> select * from dept_partition2 where day='20200401' and hour='13'; -- Execute Repair Command hive> msck repair table dept_partition2; -- Query data again hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
Mode 2: Add partitions after uploading data
-- Upload data: hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14; hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14; -- Execute Add Partition hive (default)> alter table dept_partition2 add partition(day='201709',hour='14'); -- Query Data hive (default)> select * from dept_partition2 where day='20200401' and hour='14';
Mode 3: load data to partition after creating folder
-- Create directory hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15; -- Upload data hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15'); -- Query Data hive (default)> select * from dept_partition2 where day='20200401' and hour='15';
1.3. Dynamic partition adjustment
In relational databases, when inserting partition table Insert data, the database automatically inserts data into the appropriate partition based on the value of the partition field. Hive also provides a similar mechanism, dynamic partition, except that using Hive's dynamic partition requires appropriate configuration.
1.3.1. Turn on dynamic partition parameter settings
1) Turn on dynamic partitioning (default true, on)
hive.exec.dynamic.partition=true
2) Set to non-strict mode (dynamic partitioning mode, default strict, means that at least one partition must be specified as static partition, and non-strict mode means that dynamic partitioning is allowed for all partition fields.)
hive.exec.dynamic.partition.mode=nonstrict
3) The maximum number of dynamic partitions that can be created on all MR nodes. Default 1000
hive.exec.max.dynamic.partitions=1000
4) Maximum number of dynamic partitions can be created at each node where MR is performed. This parameter needs to be set based on the actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, then the parameter needs to be set to be greater than 365, and if the default value of 100 is used, an error will be reported.
hive.exec.max.dynamic.partitions.pernode=100
5) How many HDFS files can be created in the whole MR Job. Default 100000
hive.exec.max.created.files=100000
6) Whether an exception is thrown when an empty partition is generated. Settings are generally not required. Default false
hive.error.on.empty.partition=false
1.3.2. Case Practice
Requirement: Insert data from dept table into target table dept_by region (loc field) Partition in the corresponding partition.
1) Create target partition table
hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';
2) Setting up dynamic partitions
set hive.exec.dynamic.partition.mode = nonstrict; hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;
3) View partitions of the target partition table
hive (default)> show partitions dept_partition;
Think: How does the target partition table match to the partition field?
2. Barrel table
Partitions provide a convenient way to isolate data and optimize queries. However, not all datasets form reasonable partitions. For a table or partition, Hive can be further organized into barrels, which are more granular data range partitions. Bucketing is another technique for breaking up data into more manageable parts. Partitions are for data storage paths; Buckets are for data files.
2.1. Create bucket table
1) Data preparation
1001 ss1 1002 ss2 1003 ss3 1004 ss4 1005 ss5 1006 ss6 1007 ss7 1008 ss8 1009 ss9 1010 ss10 1011 ss11 1012 ss12 1013 ss13 1014 ss14 1015 ss15 1016 ss16
2) Create bucket table
create table stu_buck(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by '\t';
3) View table structure
hive (default)> desc formatted stu_buck; Num Buckets: 4
4) How to load data into bucket table
hive (default)> load data inpath '/student.txt' into table stu_buck;
5) See if the barrel table created is divided into four barrels
6) Query bucket-wise data
hive(default)> select * from stu_buck;
7) Bucketing rules:
Hive's barrel hashes the values of the barrel field and divides them by the number of barrels to determine which barrel the record is stored in.
2.2. Notes for barrel table operation:
- Set the number of reduces to -1, letting Job decide how many reduces to use or set the number of reduces to be greater than or equal to the number of buckets in the bucket table
- load data from hdfs into bucket table to avoid local file missing
- Do not use local mode
2.3. Importing data into bucket table by insert
hive(default)>insert into table stu_buck select * from student_insert;
3. Sampling queries
For very large datasets, sometimes the user needs to use a representative query result instead of the whole result. Hive can do this by sampling tables.
Syntax: TABLESAMPLE(BUCKET x OUT OF y)
Query table stu_ Data in buck:
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Note: The value of x must be less than or equal to the value of y, otherwise
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck
Note: Links to other Hive related series articles are entered here - > Summary of Hive Articles