Hudi + Spark3 introduction lesson 1

Hudi + Spark3 introduction lesson 1

Apache Hudi is the next generation streaming data Lake platform. Apache Hudi will migrate the data warehouse and database core functions to the data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion, data clustering / compression optimization, and concurrency, while using open source file formats for data.

Welcome to visit My blog:

hudi 0.10.1 source code compilation

  • maven 3.5.4, spark3.1.1, configured with the maven source of aliyun
  • Modify pom The spark version in XML is 3.1.1, which was originally 3.1.2. There is little difference between small versions. It depends on your own environment.
  • Compile command MVN clean package -dskiptests -dspala-2.12 -dspark3
  • The Hudi spark bundle directory of the compiled product under packaging
  • Spark binding jar package name: hudi-spark 3.1.1-bundle_ 2.12-0.10.1. Jar, about 38M in size

The previous version of hudi 0.9.0 has obvious problems when used with spark3.1. It can be used with spark3.0.3. Of course, this is also mentioned in hudi's release notes.

  • hudi-spark3.1.2-bundle_2.12-0.10.1.jar
  • hudi-spark3.0.3-bundle_2.12-0.10.1.jar

These two packages can be compiled from maven Central warehouse Get it. (the page is hard to find. hudi has to sort out the warehouse categories.) post it.

# Version 3.1.2
<!-- -->

# Version 3.0.3
<!-- -->

Using the above precompiled package, you can omit the process of compiling yourself.

Support matrix published on the official website:

Spark 3 Support Matrix

HudiSupported Spark 3 versions
0.10.0 - (default build), 3.0.x
0.7.0 -
0.6.0 and priorNot supported

You can see that the default build of hudi 0.10 is spark3.1, and you can also build spark3.0.

testing procedure

environmental information

  1. Spark3.1.1 is installed
  2. Hive3.1 installed
  3. Operating system: CentOS 7.4
  4. Java 8

Spark3 quick test

  1. Copy the hudi jar to jars in the spark installation directory, for example

    cp hudi-spark3.1.1-bundle_2.12-0.10.1.jar /usr/hdp/
  2. Start spark SQL client to see if it is normal:

    • Because we have put the jar of Hudi spark into the jar package loading path of spark, we do not need to specify it explicitly.

    • In addition, if an error of permission class is reported, users with hive access can be switched. Here is the operation performed by using hive users.

    ./bin/spark-sql --master yarn --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  3. Display > wait for command input! (at this time, I haven't copied the avro package, and I haven't reported any errors.)

  4. Create a non partitioned hudi table:

    The default is a cow format table. The default primary key is uuid, and there is no pre aggregation field.

    create table hudi_cow_nonpcf_tbl (
      uuid int,
      name string,
      price double
    ) using hudi;
  5. View the table: show tables. You can see the table name in a row: default hudi_table0 false

  6. Insert 2 pieces of data into the table:

    insert into hudi_table0 select 1, 'my name is kiki', 20;
    insert into hudi_table0 select 2, 'qiqi', 16;
  7. Query the data just now:

    select * from hudi_table0;
    Time taken 0.361 seconds,Fetched 2 row(s)
  8. Try the ox knife, ok.

Tags: Big Data Java hive

Posted by evaoparah on Wed, 01 Jun 2022 08:24:55 +0530