hive on spark hql 插入数据报错 Failed to create Spark client for Spark session Error code 30041

网友投稿 1437 2022-11-12

hive on spark hql 插入数据报错 Failed to create Spark client for Spark session Error code 30041

hive on spark hql 插入数据报错 Failed to create Spark client for Spark session Error code 30041

文章目录

​​一、遇到问题​​​​二、排查过程:​​

​​0、确认 hive、spark 版本​​​​1、确认 SPARK_HOME 环境变量​​​​2、hive 创建的 spark 配置文件​​​​3、确认是否创建 hdfs 存储历史日志路径​​​​4、确认 是否上传 Spark 纯净版 jar 包​​​​5、确认 hive-site.xml 配置文件​​

​​三、解决问题​​​​四、后记​​

一、遇到问题

离线数仓 hive on spark 模式,hive 客户端 sql 插入数据报错​​​​​

Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session 50cec71c-2636-4d99-8de2-a580ae3f1c58)'​​​​FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 50cec71c-2636-4d99-8de2-a580ae3f1c58​​

以下是报错详情:

[hadoop@hadoop102 ~]$ hivewhich: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/datafs/module/jdk1.8.0_212/bin:/datafs/module/hadoop-3.1.3/bin:/datafs/module/hadoop-3.1.3/sbin:/datafs/module/zookeeper-3.5.7/bin:/datafs/module/kafka/bin:/datafs/module/flume/bin:/datafs/module/mysql-5.7.35/bin:/datafs/module/hive/bin:/datafs/module/spark/bin:/home/hadoop/.local/bin:/home/hadoop/bin)Hive Session ID = 7db87c21-d9fb-4e76-a868-770691199377Logging initialized using configuration in jar:file:/datafs/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: trueHive Session ID = 24cd3001-0726-482f-9294-c901f49ace29hive (default)> show databases;OKdatabase_namedefaultTime taken: 1.582 seconds, Fetched: 1 row(s)hive (default)> show tables;OKtab_namestudentTime taken: 0.118 seconds, Fetched: 1 row(s)hive (default)> select * from student;OKstudent.id student.nameTime taken: 4.1 secondshive (default)> insert into table student values(1,'abc');Query ID = hadoop_20220728195619_ded278b4-0ffa-41f2-9f2f-49313ea3d752Total jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session 50cec71c-2636-4d99-8de2-a580ae3f1c58)'FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 50cec71c-2636-4d99-8de2-a580ae3f1c58hive (default)> [hadoop@hadoop102 ~]$

二、排查过程:

0、确认 hive、spark 版本

hive3.1.2:apache-hive-3.1.2-bin.tar.gz (重新编译之后的)

spark3.0.0: +spark-3.0.0-bin-hadoop3.2.tgz +spark-3.0.0-bin-without-hadoop.tgz

兼容性说明 注意:官网-的 Hive 3.1.2 和 Spark 3.0.0 默认是不兼容的。因为 Hive3.1.2 支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。 编译步骤: 官网-Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。

1、确认 SPARK_HOME 环境变量

[hadoop@hadoop102 software]$ sudo vim /etc/profile.d/my_env.sh# 添加如下内容# SPARK_HOMEexport SPARK_HOME=/opt/module/sparkexport PATH=$PATH:$SPARK_HOME/bin

source 使其生效

[hadoop@hadoop102 software]$ source /etc/profile.d/my_env.sh

2、hive 创建的 spark 配置文件

在hive中创建spark配置文件

[atguigu@hadoop102 software]$ vim /opt/module/hive/conf/spark-defaults.conf# 添加如下内容(在执行任务时,会根据如下参数执行)spark.master yarnspark.eventLog.enabled truespark.eventLog.dir hdfs://hadoop102:8020/spark-historyspark.executor.memory 1gspark.driver.memory 1g

3、确认是否创建 hdfs 存储历史日志路径

确认存储历史日志路径是否创建

[hadoop@hadoop102 conf]$ hdfs dfs -ls /Found 4 itemsdrwxr-xr-x - hadoop supergroup 0 2022-07-28 20:31 /spark-historydrwxr-xr-x - hadoop supergroup 0 2022-03-15 16:42 /testdrwxrwx--- - hadoop supergroup 0 2022-03-16 09:14 /tmpdrwxrwxrwx - hadoop supergroup 0 2022-07-28 18:38 /user

若不存在,则需要在HDFS创建如下路径

[hadoop@hadoop102 software]$ hadoop fs -mkdir /spark-history

4、确认 是否上传 Spark 纯净版 jar 包

说明1:由于Spark3.0.0非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。

说明2:Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。

[hadoop@hadoop102 software]$ tar -zxvf /opt/software/spark-3.0.0-bin-without-hadoop.tgz

上传Spark纯净版jar包到HDFS

[hadoop@hadoop102 software]$ hadoop fs -mkdir /spark-jars

[hadoop@hadoop102 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars

5、确认 hive-site.xml 配置文件

[hadoop@hadoop102 ~]$ vim /opt/module/hive/conf/hive-site.xml

添加如下内容

spark.yarn.jars hdfs://hadoop102:8020/spark-jars/* hive.execution.engine spark

三、解决问题

在 ​​hive/conf/hive-site.xml​​​中追加: (这里延长了 hive 和 spark 连接的时间,可以有效避免超时报错)

hive.spark.client.connect.timeout 100000ms

这时,重新打开 hive 客户端,插入数据正常无报错

[hadoop@hadoop102 conf]$ hivewhich: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/datafs/module/jdk1.8.0_212/bin:/datafs/module/hadoop-3.1.3/bin:/datafs/module/hadoop-3.1.3/sbin:/datafs/module/zookeeper-3.5.7/bin:/datafs/module/kafka/bin:/datafs/module/flume/bin:/datafs/module/mysql-5.7.35/bin:/datafs/module/hive/bin:/datafs/module/spark/bin:/home/hadoop/.local/bin:/home/hadoop/bin)Hive Session ID = b7564f00-0c04-45fd-9984-4ecd6e6149c2Logging initialized using configuration in jar:file:/datafs/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: trueHive Session ID = e4af620a-8b6a-422e-b921-5d6c58b81293hive (default)>

插入第一条数据,需要初始化 spark session 所以慢

hive (default)> insert into table student values(1,'abc');Query ID = hadoop_20220728201636_11b37058-89dc-4050-a4bf-1dcf404bd579Total jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Running with YARN Application = application_1659005322171_0009Kill Command = /datafs/module/hadoop-3.1.3/bin/yarn application -kill application_1659005322171_0009Hive on Spark Session Web UI URL: Hive on Spark job[0] stages: [0, 1]Spark job[0] status = RUNNING-------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED--------------------------------------------------------------------------------------Stage-0 ........ 0 FINISHED 1 1 0 0 0Stage-1 ........ 0 FINISHED 1 1 0 0 0--------------------------------------------------------------------------------------STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 40.06 s--------------------------------------------------------------------------------------Spark job[0] finished successfully in 40.06 second(s)WARNING: Spark Job[0] Spent 16% (3986 ms / 25006 ms) of task time in GCLoading data to table default.studentOKcol1 col2Time taken: 127.46 secondshive (default)>

下面再插入数据就快了

hive (default)> insert into table student values(2,'ddd');Query ID = hadoop_20220728202000_1093388b-3ec6-45e5-a9f1-1b07c64f2583Total jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Running with YARN Application = application_1659005322171_0009Kill Command = /datafs/module/hadoop-3.1.3/bin/yarn application -kill application_1659005322171_0009Hive on Spark Session Web UI URL: Hive on Spark job[1] stages: [2, 3]Spark job[1] status = RUNNING-------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED--------------------------------------------------------------------------------------Stage-2 ........ 0 FINISHED 1 1 0 0 0Stage-3 ........ 0 FINISHED 1 1 0 0 0--------------------------------------------------------------------------------------STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 2.12 s--------------------------------------------------------------------------------------Spark job[1] finished successfully in 3.20 second(s)Loading data to table default.studentOKcol1 col2Time taken: 6.0 secondshive (default)>

查询数据

hive (default)> select * from student;OKstudent.id student.name1 abc2 dddTime taken: 0.445 seconds, Fetched: 2 row(s)hive (default)> [hadoop@hadoop102 conf]$

四、后记

遇到问题,不放弃 网上搜索了很多解决方案,不靠谱的很多 靠谱的是这个大佬在 ​​​ 评论区写的

尝试到第三种思路,瞬间解决

第一条数据插入成功的那一刻,是久违的成就感,开心

分享这篇 blog,一是记录解决问题的过程,二是帮助萌新小白

我们下期见,拜拜!

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:机器学习面试题总结
下一篇:使用 SAP UI5 FileUploader 控件上传本地文件试读版
相关文章

 发表评论

暂时没有评论,来抢沙发吧~