操作系统寒武纪 - 会让企业IT高兴吗?
1410
2022-08-25
spark计算两个DataFrame的差集、交集、合集
spark 计算两个dataframe 的差集、交集、合集,只选择某一列来对比比较好。新建两个 dataframe :
import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.SQLContextdef main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("TTyb").setMaster("local") val sc = new SparkContext(conf) val spark = new SQLContext(sc) val sentenceDataFrame = spark.createDataFrame(Seq( (1, "asf"), (2, "2143"), (3, "rfds") )).toDF("label", "sentence") sentenceDataFrame.show() val sentenceDataFrame1 = spark.createDataFrame(Seq( (1, "asf"), (2, "2143"), (4, "f8934y") )).toDF("label", "sentence") sentenceDataFrame1.show()}
差集 except
val newDF = sentenceDataFrame1.select("sentence").except(sentenceDataFrame.select("sentence"))newDF.show()
+--------+ |sentence| +--------+ |f8934y | +--------+
交集 intersect
val newDF = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))newDF.show()
+--------+ |sentence| +--------+ | asf| | 2143| +--------+
合集 union
val newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence"))newDF.show()
+--------+ |sentence| +--------+ | asf| | 2143| | f8934y| | asf| | 2143| | rfds| +--------+
合集最好去一下重 distinct :
val newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence")).distinct()newDF.show()
+--------+ |sentence| +--------+ | rfds| | asf| | 2143| | f8934y| +--------+
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~