Some pitfalls encountered when uploading data locally to hdfs

1. Background

During work, it is necessary to initialize the data of a table. The initialized data is in txt format and stored locally. The file is several G. First, test it locally. There are two methods for uploading data to hdfs, one is load and the other is put. Try both methods.
The test data used is /tmp/odp/hive/paoshu/t3, the content is as follows, separated by commas:
'test', 87.6, 100
'Long amount','',23

Second, the target table is a non-partitioned table

build table

create table tmp.tmp_xyy
(sys_name string,
its_n_all_score_last_m string,
its_n_all_downtime_score_m string
)row format delimited fields terminated by ',' stored as textfile;

This delimiter is very important and cannot be written incorrectly. It must be consistent with the actual delimiter of the data, otherwise all the data in the final table may be squeezed into one field.
After the table is built, check its hdfs path:

show partition tmp.tmp_xyy;

its path is

hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy

Use the put command to upload the data file to the specified directory

hdfs dfs -put /tmp/odp/hive/paoshu/t3  hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy

view hdfs file

hdfs dfs -ls -R hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy

The partition file already exists
view data

select * from tmp.tmp_xyy t;

It can be seen that the data already exists

t.sys_namet.its_n_all_score_last_mt.its_n_all_downtime_score_m
'test'87.6100
'long amount'''23

It can be analyzed that, in fact, it is quite simple for the target table to be a non-partitioned table. The focus is on the analysis of uploading to the partitioned table.

Third, the target table is a partition table

Create a partition table with batch_date as the partition field

create table tmp.tmp_xyy1
(sys_name string,
its_n_all_score_last_m string,
its_n_all_downtime_score_m string
)partitioned by (batch_date string)
row format delimited fields terminated by ',' stored as textfile;

Its hdfs path is

hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1
  1. Upload with load and test with partition batch_date=20230101
load data local inpath '/tmp/odp/hive/paoshu/t3' into table tmp.tmp_xyy1 partition(batch_date=20230101)

Unexpectedly, an error was reported,
Unable to recognize the path, I checked the information to know that this local in hive refers to the machine where the hiveserver service is located.
Instead of the machine where the hivecli or beeline client is located (the production environment is mostly hiveserver and hivecli are not on the same machine)
, and in most cases, we do not have permission to use hiveserver, so we can only put the data files in hdfs first, so that the files can be identified.

hdfs dfs -put /tmp/odp/hive/paoshu/t3  hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1

Then load again, unexpectedly reported an error again

It turns out that the data file has been placed in hdfs at this time, not local, so the local should be removed.

load data  inpath '/tmp/odp/hive/paoshu/t3' into table tmp.tmp_xyy1 partition(batch_date=20230101)

It should be noted that the tenant of put ting the file to the target address must be consistent with the tenant of load ing
, otherwise a permission problem will be reported.
At this time, go to check the partition again.

show partitions tmp.tmp_xyy1
partition
batch_date=20230101

It worked.

2. Use put to upload, and use the same data file for testing, but the test partition we used this time is batch_date=20230102, which is different from load
Yes, put needs to create a target address first.

hdfs dfs -mkdir hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102

Then put the file to the target address

hdfs dfs -put /tmp/odp/hive/paoshu/t3 hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102

Different from load, after uploading to the partition, further metadata needs to be added

alter table tmp.tmp_xyy1 add if not exists partition (batch_date=20230102)

Otherwise, no matter what, the data of partition 20230102 cannot be seen.
At this point we go to check the partition and data of tmp.tmp_xyy1, the partition already exists

show partitions tmp.tmp_xyy1
partition
batch_date=20230101
batch_date=20230102

You can view the data, partition 20230102 has no data,

select * from tmp.tmp_xyy1 t;
t.sys_namet.its_n_all_score_last_mt.its_n_all_downtime_score_mt.batch_date
'test'87.610020230101
'long amount'''2320230101

The metadata has indeed been added, but the data has not been read, which is strange, hdfs manages data files, check hdfs,

hdfs dfs -ls -R hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1


It turned out that when the hdfs path was created manually, batch_date=20230102 was missing, so delete the partition and rebuild it

hdfs dfs -rm  hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/20230102
hdfs dfs -rm  hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/batch_date=20230102

put the data to the new hdfs directory again

hdfs dfs -put /tmp/odp/hive/paoshu/t3  hdfs://hacluster/user/hive/warehouse/tmp.db/tmp_xyy1/batch_date=20230102

At this point we look at the data

t.sys_namet.its_n_all_score_last_mt.its_n_all_downtime_score_mt.batch_date
'test'87.610020230101
'long amount'''2320230101
'test'87.610020230102
'long amount'''2320230102

You're done.

Summarize

1. To upload a local file to the partition table, you can use load and put. Load is relatively simple. It is recommended to put the file in hdfs first, and then upload it to the specified partition directory.
2. When manually creating partitions and put ting data files to the target directory, you must use the same tenant as the table to ensure that no permission errors are reported.
3. Manually create partitions, and the partition directory cannot be wrong. For example, batch_date= is missing here, so even if the partition is created successfully, the partition data cannot be seen.

Tags: Big Data Hadoop hdfs

Posted by razorsedgeuk on Sat, 11 Mar 2023 11:21:18 +0530