智慧屏 安装 app如何提升家庭娱乐与教育体验的关键工具
801
2022-09-02
Scala-104Spark.sql的内置日期时间函数
有些时候我们会直接用df.createOrReplaceTempView(temp)创建临时表,用sql去计算。sparkSQL有些语法和hql不一样,做个笔记。
import org.apache.spark.sql.functions._import spark.implicits._import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.linalg.{Vector, Vectors}import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._import spark.implicits._import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.linalg.{Vector, Vectors}import org.apache.spark.sql.{DataFrame, Row, SparkSession}
val builder = SparkSession .builder() .appName("learningScala") .config("spark.executor.heartbeatInterval","60s") .config("spark-work.timeout","120s") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.kryoserializer.buffer.max","512m") .config("spark.dynamicAllocation.enabled", false) .config("spark.sql.inMemoryColumnarStorage.compressed", true) .config("spark.sql.inMemoryColumnarStorage.batchSize", 10000) .config("spark.sql.broadcastTimeout", 600) .config("spark.sql.autoBroadcastJoinThreshold", -1) .config("spark.sql.crossJoin.enabled", true) .master("local[*]") val spark = builder.getOrCreate()spark.sparkContext.setLogLevel("ERROR")
builder: org.apache.spark.sql.SparkSession.Builder = org.apache.spark.sql.SparkSession$Builder@5418e964spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@15775384
var df1 = Seq((1,"2019-04-01 11:45:50",11.15,"2019-04-02 11:45:49"),(2,"2019-05-02 11:56:50",10.37,"2019-05-02 11:56:51"),(3,"2019-07-21 12:45:50",12.11,"2019-08-21 12:45:50"),(4,"2019-08-01 12:40:50",14.50,"2020-08-03 12:40:50"),(5,"2019-01-06 10:00:50",16.39,"2019-01-05 10:00:50")).toDF("id","startTimeStr", "payamount","endTimeStr")df1 = df1.withColumn("startTime",$"startTimeStr".cast("Timestamp")) .withColumn("endTime",$"endTimeStr".cast("Timestamp"))df1.printSchemadf1.show()
root |-- id: integer (nullable = false) |-- startTimeStr: string (nullable = true) |-- payamount: double (nullable = false) |-- endTimeStr: string (nullable = true) |-- startTime: timestamp (nullable = true) |-- endTime: timestamp (nullable = true)+---+-------------------+---------+-------------------+-------------------+-------------------+| id| startTimeStr|payamount| endTimeStr| startTime| endTime|+---+-------------------+---------+-------------------+-------------------+-------------------+| 1|2019-04-01 11:45:50| 11.15|2019-04-02 11:45:49|2019-04-01 11:45:50|2019-04-02 11:45:49|| 2|2019-05-02 11:56:50| 10.37|2019-05-02 11:56:51|2019-05-02 11:56:50|2019-05-02 11:56:51|| 3|2019-07-21 12:45:50| 12.11|2019-08-21 12:45:50|2019-07-21 12:45:50|2019-08-21 12:45:50|| 4|2019-08-01 12:40:50| 14.5|2020-08-03 12:40:50|2019-08-01 12:40:50|2020-08-03 12:40:50|| 5|2019-01-06 10:00:50| 16.39|2019-01-05 10:00:50|2019-01-06 10:00:50|2019-01-05 10:00:50|+---+-------------------+---------+-------------------+-------------------+-------------------+df1: org.apache.spark.sql.DataFrame = [id: int, startTimeStr: string ... 4 more fields]df1: org.apache.spark.sql.DataFrame = [id: int, startTimeStr: string ... 4 more fields]
timestamp转string
把timestamp转换成对应格式字符串
date_format把timestamp转换成对应的字符串字符串格式用"yyyyMMdd"表示
df1.createOrReplaceTempView("temp")var sql = """SELECT date_format(startTime,'yyyyMMdd') AS yyyyMMdd, date_format(startTime,'yyyy-MM-dd') AS yyyy_MM_dd, date_format(startTime,'yyyy') AS yyyyFROM TEMP"""spark.sql(sql).printSchemaspark.sql(sql).show()
root |-- yyyyMMdd: string (nullable = true) |-- yyyy_MM_dd: string (nullable = true) |-- yyyy: string (nullable = true)+--------+----------+----+|yyyyMMdd|yyyy_MM_dd|yyyy|+--------+----------+----+|20190401|2019-04-01|2019||20190502|2019-05-02|2019||20190721|2019-07-21|2019||20190801|2019-08-01|2019||20190106|2019-01-06|2019|+--------+----------+----+sql: String ="SELECT date_format(startTime,'yyyyMMdd') AS yyyyMMdd, date_format(startTime,'yyyy-MM-dd') AS yyyy_MM_dd, date_format(startTime,'yyyy') AS yyyyFROM TEMP"
timestamp转date
to_date可以把timestamp转换成date类型
sql = """SELECT startTime,endTime, to_date(startTime) AS startDate, to_date(endTime) AS endDateFROM TEMP"""var df2 = spark.sql(sql)df2.printSchemadf2.show()
root |-- startTime: timestamp (nullable = true) |-- endTime: timestamp (nullable = true) |-- startDate: date (nullable = true) |-- endDate: date (nullable = true)+-------------------+-------------------+----------+----------+| startTime| endTime| startDate| endDate|+-------------------+-------------------+----------+----------+|2019-04-01 11:45:50|2019-04-02 11:45:49|2019-04-01|2019-04-02||2019-05-02 11:56:50|2019-05-02 11:56:51|2019-05-02|2019-05-02||2019-07-21 12:45:50|2019-08-21 12:45:50|2019-07-21|2019-08-21||2019-08-01 12:40:50|2020-08-03 12:40:50|2019-08-01|2020-08-03||2019-01-06 10:00:50|2019-01-05 10:00:50|2019-01-06|2019-01-05|+-------------------+-------------------+----------+----------+sql: String =SELECT startTime,endTime, to_date(startTime) AS startDate, to_date(endTime) AS endDateFROM TEMPdf2: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp ... 2 more fields]
求时间差
天数差函数datediff可以应用在timestamp中,也可应用在date类型中,单位是自然天,而不是24小时月份差函数months_between同样可以,月度的单位好像是不固定的,即31天or30天
df2.createOrReplaceTempView("temp")var sql = """SELECT startTime, endTime, datediff(endTime,startTime) AS dayInterval1, datediff(endDate,startDate) AS dayInterval2, months_between(endTime,startTime) AS monthInterval1, months_between(endDate,startDate) AS monthInterval2FROM TEMP"""// spark.sql(sql).printSchemaspark.sql(sql).show()
+-------------------+-------------------+------------+------------+--------------+--------------+| startTime| endTime|dayInterval1|dayInterval2|monthInterval1|monthInterval2|+-------------------+-------------------+------------+------------+--------------+--------------+|2019-04-01 11:45:50|2019-04-02 11:45:49| 1| 1| 0.03225769| 0.03225806||2019-05-02 11:56:50|2019-05-02 11:56:51| 0| 0| 0.0| 0.0||2019-07-21 12:45:50|2019-08-21 12:45:50| 31| 31| 1.0| 1.0||2019-08-01 12:40:50|2020-08-03 12:40:50| 368| 368| 12.06451613| 12.06451613||2019-01-06 10:00:50|2019-01-05 10:00:50| -1| -1| -0.03225806| -0.03225806|+-------------------+-------------------+------------+------------+--------------+--------------+sql: String ="SELECT startTime, endTime, datediff(endTime,startTime) AS dayInterval1, datediff(endDate,startDate) AS dayInterval2, months_between(endTime,startTime) AS monthInterval1, months_between(endDate,startDate) AS monthInterval2FROM TEMP"
Ref
2020-03-24 于南京市江宁区九龙湖
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~