spark计算两个DataFrame的差集、交集、合集

网友投稿 1410 2022-08-25

spark计算两个DataFrame的差集、交集、合集

spark计算两个DataFrame的差集、交集、合集

​​spark​​​ 计算两个​​dataframe​​​ 的差集、交集、合集,只选择某一列来对比比较好。新建两个 ​​dataframe​​ :

import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.SQLContextdef main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("TTyb").setMaster("local") val sc = new SparkContext(conf) val spark = new SQLContext(sc) val sentenceDataFrame = spark.createDataFrame(Seq( (1, "asf"), (2, "2143"), (3, "rfds") )).toDF("label", "sentence") sentenceDataFrame.show() val sentenceDataFrame1 = spark.createDataFrame(Seq( (1, "asf"), (2, "2143"), (4, "f8934y") )).toDF("label", "sentence") sentenceDataFrame1.show()}

差集 except

val newDF = sentenceDataFrame1.select("sentence").except(sentenceDataFrame.select("sentence"))newDF.show()

+--------+ |sentence| +--------+ |f8934y | +--------+

交集 intersect

val newDF = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))newDF.show()

+--------+ |sentence| +--------+ | asf| | 2143| +--------+

合集 union

val newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence"))newDF.show()

+--------+ |sentence| +--------+ | asf| | 2143| | f8934y| | asf| | 2143| | rfds| +--------+

合集最好去一下重 ​​distinct​​ :

val newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence")).distinct()newDF.show()

+--------+ |sentence| +--------+ | rfds| | asf| | 2143| | f8934y| +--------+

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:#yyds干货盘点#Linux文件目录核心命令1
下一篇:FP-tree推荐算法
相关文章

 发表评论

暂时没有评论,来抢沙发吧~