Table of contents
Four, mutual conversion operation
1. Concept
RDD, DataFrame, and Dataset are all distributed elastic datasets under the spark platform, which facilitate the processing of very large data. RDD, as the core data abstraction of Spark, is an indispensable existence in Spark. In SparkSQL, Spark provides us with two new abstractions, namely DataFrame and DataSet.
The following describes the concepts of the three:
1. RDD: full name Resilient Distributed Dataset, elastic distributed Dataset, the most basic data abstraction in Spark, is characterized by the fact that RDD only contains the data itself and has no data structure.
2. DataFrame: It is also a distributed data container. In addition to the data itself, it also records the structural information of the data, namely schema ; Structural information facilitates Spark to know which columns are included in the data set, and what is the type and data of each column.
3. DataSet: Spark The data abstraction at the top level includes not only the data itself, but also records the data structure information schema, and also includes the type of the data set, that is, to really make the data set into a java object form, you need to create a sample class first case class, the data is made into the format of the sample class, and each column is an attribute in the sample class.
2. Common points
1. RDD, DataFrame, and DataSet are all distributed elastic data sets under the spark platform, which provide convenience for processing very large data;
2. The three have an inert mechanism. When creating and converting, such as the map method, they will not be executed immediately. Only when an Action such as foreach is encountered, the three will start traversing operations;
3. The three have many common functions, such as filter, sorting, etc.;
4. This package is required for both DataFrame and DataSet operations: import spark.implicits.
5. All three will automatically cache calculations according to the memory status of Spark, so that even if the amount of data is large, there is no need to worry about memory overflow.
6. All three have the concept of partition
var predata=data.repartition(24).mapPartitions{ PartLine => { PartLine.map{ line => println("Transform operation") } } }
In this way, when operating each partition, it is the same as operating an array. Not only the amount of data is relatively small, but also the calculation results in the map can be easily taken out. If you use the map directly, the external operations in the map are invalid. like
val rdd=spark.sparkContext.parallelize(Seq(("a", 1), ("b", 1), ("a", 1))) var flag=0 val test=rdd.map{line=> println("run") flag+=1 println(flag) line._1 } println(test.count) println(flag) /** run 1 run 2 run 3 3 0 * */
When partition is not used, operations outside of map cannot affect variables outside of map
7. Both DataFrame and DataSet can use pattern matching to obtain the value and type of each field
DataFrame:
testDF.map{ case Row(col1:String,col2:Int)=> println(col1);println(col2) col1 case _=> "" } DataSet: case class Coltest(col1:String,col2:Int)extends Serializable //Define field names and types testDS.map{ case Coltest(col1:String,col2:Int)=> println(col1);println(col2) col1 case _=> "" }
Three, the difference
RDD:
- RDD is generally used at the same time as spark mlib
- RDD does not support sparksql operations
- RDD is a collection of distributed Java objects
DataFrame:
- Different from RDD and Dataset, the type of each row of DataFrame is fixed as Row, and the value of each field can only be obtained through parsing, such as
testDF.foreach{ line => val col1=line.getAs[String]("col1") val col2=line.getAs[String]("col2") }
-
DataFrame and Dataset are generally used together with spark ml
-
Both DataFrame and Dataset support sparksql operations, such as select, groupby, etc., and can also register temporary tables/windows, and perform sql statement operations, such as
-
DataFrame and Dataset support some particularly convenient saving methods, such as saving as csv, which can be brought with the header, so that the field names of each column are clear at a glance
//save val saveoptions = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://172.xx.xx.xx:9000/test") datawDF.write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).options(saveoptions).save() //read val options = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://172.xx.xx.xx:9000/test") val datarDF= spark.read.options(options).format("com.databricks.spark.csv").load()
Dataset:
Here we mainly compare Dataset and DataFrame, because Dataset and DataFrame have exactly the same member functions, the difference is that the data type of each row is different
DataFrame can also be called Dataset[Row]. The type of each row is Row. Without parsing, it is impossible to know which fields are in each row and what type each field is. You can only use the above-mentioned getAS method or common The pattern matching mentioned in the seventh article takes out specific fields
In Dataset, the type of each row is not certain. After customizing the case class, you can freely obtain the information of each row
case class Coltest(col1:String,col2:Int)extends Serializable //Define field names and types /** rdd ("a", 1) ("b", 1) ("a", 1) * */ val test: Dataset[Coltest]=rdd.map{line=> Coltest(line._1,line._2) }.toDS test.map{ line=> println(line.col1) println(line.col2) }
It can be seen that Dataset is very convenient when you need to access a certain field in a column. However, if you want to write some functions with strong adaptability, if you use Dataset, the type of row is uncertain, and it may be various The case class cannot be adapted. At this time, using DataFrame, that is, Dataset[Row], can solve the problem better
Let's briefly describe RDD and DataSet
DataSet is represented by Catalyst logical execution plan, and the data is stored in encoded binary form, and operations such as sorting and shuffle can be performed without deserialization.
The creation of DataSet requires an explicit Encoder to serialize the object into binary and map the scheme of the object to the SparkSQl type. However, RDD relies on the runtime reflection mechanism.
Through the above two points, the performance of DataSet is much better than that of RDD.
Four, mutual conversion operation
RDD, DataFrame, and Dataset have many commonalities, and they often need to be converted between the three when they have their own applicable scenarios.
DataFrame/Dataset to RDD
val rdd1=testDF.rdd val rdd2=testDS.rdd
RDD to DataFrame
import spark.implicits._ val testDF = rdd.map {line=> (line._1,line._2) }.toDF("col1","col2")
RDD to Dataset
import spark.implicits._ case class Coltest(col1:String,col2:Int)extends Serializable //Define field names and types val testDS = rdd.map {line=> Coltest(line._1,line._2) }.toDS
Dataset to DataFrame
import spark.implicits._ val testDF = testDS.toDF
DataFrame to Dataset
import spark.implicits._ case class Coltest(col1:String,col2:Int)extends Serializable //Define field names and types val testDS = testDF.as[Coltest]
Note: When using some special operations, be sure to add import spark.implicits._ otherwise toDF and toDS cannot be used