1. Background
During work, it is necessary to initialize the data of a table. The initialized data is in txt format and stored locally. The file is several G. First, test it locally. There are two methods for uploading data to hdfs, one is load and the other is put. Try both methods.
The test data used is /tmp/odp/hive/paoshu/t3, the content is as follows, separated by commas:
'test', 87.6, 100
'Long amount','',23
Second, the target table is a non-partitioned table
build table
create table tmp.tmp_xyy (sys_name string, its_n_all_score_last_m string, its_n_all_downtime_score_m string )row format delimited fields terminated by ',' stored as textfile;
This delimiter is very important and cannot be written incorrectly. It must be consistent with the actual delimiter of the data, otherwise all the data in the final table may be squeezed into one field.
After the table is built, check its hdfs path:
show partition tmp.tmp_xyy;
its path is
hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy
Use the put command to upload the data file to the specified directory
hdfs dfs -put /tmp/odp/hive/paoshu/t3 hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy
view hdfs file
hdfs dfs -ls -R hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy
The partition file already exists
view data
select * from tmp.tmp_xyy t;
It can be seen that the data already exists
t.sys_name | t.its_n_all_score_last_m | t.its_n_all_downtime_score_m |
---|---|---|
'test' | 87.6 | 100 |
'long amount' | '' | 23 |
It can be analyzed that, in fact, it is quite simple for the target table to be a non-partitioned table. The focus is on the analysis of uploading to the partitioned table.
Third, the target table is a partition table
Create a partition table with batch_date as the partition field
create table tmp.tmp_xyy1 (sys_name string, its_n_all_score_last_m string, its_n_all_downtime_score_m string )partitioned by (batch_date string) row format delimited fields terminated by ',' stored as textfile;
Its hdfs path is
hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1
- Upload with load and test with partition batch_date=20230101
load data local inpath '/tmp/odp/hive/paoshu/t3' into table tmp.tmp_xyy1 partition(batch_date=20230101)
Unexpectedly, an error was reported,
Unable to recognize the path, I checked the information to know that this local in hive refers to the machine where the hiveserver service is located.
Instead of the machine where the hivecli or beeline client is located (the production environment is mostly hiveserver and hivecli are not on the same machine)
, and in most cases, we do not have permission to use hiveserver, so we can only put the data files in hdfs first, so that the files can be identified.
hdfs dfs -put /tmp/odp/hive/paoshu/t3 hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1
Then load again, unexpectedly reported an error again
It turns out that the data file has been placed in hdfs at this time, not local, so the local should be removed.
load data inpath '/tmp/odp/hive/paoshu/t3' into table tmp.tmp_xyy1 partition(batch_date=20230101)
It should be noted that the tenant of put ting the file to the target address must be consistent with the tenant of load ing
, otherwise a permission problem will be reported.
At this time, go to check the partition again.
show partitions tmp.tmp_xyy1
partition |
---|
batch_date=20230101 |
It worked.
2. Use put to upload, and use the same data file for testing, but the test partition we used this time is batch_date=20230102, which is different from load
Yes, put needs to create a target address first.
hdfs dfs -mkdir hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102
Then put the file to the target address
hdfs dfs -put /tmp/odp/hive/paoshu/t3 hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102
Different from load, after uploading to the partition, further metadata needs to be added
alter table tmp.tmp_xyy1 add if not exists partition (batch_date=20230102)
Otherwise, no matter what, the data of partition 20230102 cannot be seen.
At this point we go to check the partition and data of tmp.tmp_xyy1, the partition already exists
show partitions tmp.tmp_xyy1
partition |
---|
batch_date=20230101 |
batch_date=20230102 |
You can view the data, partition 20230102 has no data,
select * from tmp.tmp_xyy1 t;
t.sys_name | t.its_n_all_score_last_m | t.its_n_all_downtime_score_m | t.batch_date |
---|---|---|---|
'test' | 87.6 | 100 | 20230101 |
'long amount' | '' | 23 | 20230101 |
The metadata has indeed been added, but the data has not been read, which is strange, hdfs manages data files, check hdfs,
hdfs dfs -ls -R hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1
It turned out that when the hdfs path was created manually, batch_date=20230102 was missing, so delete the partition and rebuild it
hdfs dfs -rm hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102 hdfs dfs -rm hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/batch_date=20230102
put the data to the new hdfs directory again
hdfs dfs -put /tmp/odp/hive/paoshu/t3 hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/batch_date=20230102
At this point we look at the data
t.sys_name | t.its_n_all_score_last_m | t.its_n_all_downtime_score_m | t.batch_date |
---|---|---|---|
'test' | 87.6 | 100 | 20230101 |
'long amount' | '' | 23 | 20230101 |
'test' | 87.6 | 100 | 20230102 |
'long amount' | '' | 23 | 20230102 |
You're done.
Summarize
1. To upload a local file to the partition table, you can use load and put. Load is relatively simple. It is recommended to put the file in hdfs first, and then upload it to the specified partition directory.
2. When manually creating partitions and put ting data files to the target directory, you must use the same tenant as the table to ensure that no permission errors are reported.
3. Manually create partitions, and the partition directory cannot be wrong. For example, batch_date= is missing here, so even if the partition is created successfully, the partition data cannot be seen.