CueSheet：一个以漂亮的方式编写Spark 2.x应用程序的框架-FinClip官网

CueSheet：一个以漂亮的方式编写Spark 2.x应用程序的框架

网友投稿 763 2022-10-11

CueSheet：一个以漂亮的方式编写Spark 2.x应用程序的框架

CueSheet

CueSheet is a framework for writing Apache Spark 2.x applications more conveniently, designed to neatly separate the concerns of the business logic and the deployment environment, as well as to minimize the usage of shell scripts which are inconvenient to write and do not support validation. To jump-start, check out cuesheet-starter-kit which provides the skeleton for building CueSheet applications. CueSheet is featured in Spark Summit East 2017.

An example of a CueSheet application is shown below. Any Scala object extending CueSheet becomes a CueSheet application; the object body can then use the variables like sc, sqlContext, and spark to write the business logic, as if it is inside spark-shell:

import com.kakao.cuesheet.CueSheetobject Example extends CueSheet {{ val rdd = sc.parallelize(1 to 100) println(s"sum = ${rdd.sum()}") println(s"sum2 = ${rdd.map(_ + 1).sum()}")}}

CueSheet will take care of creating SparkContext or SparkSession according to the configuration given in a separate file, so that your application code can contain just the business logic. Furthermore, CueSheet will launch the application locally or to a YARN cluster by simply running your object as a Java application, eliminating the need to use spark-submit and accompanying shell scripts.

CueSheet also supports Spark Streaming applications, via ssc. When it is used in the object body, it automatically becomes a Spark Streaming application, and ssc provides access to the StreamingContext.

Importing CueSheet

libraryDependencies += "com.kakao.cuesheet" %% "cuesheet" % "0.10.0"

CueSheet can be used in Scala projects by configuring SBT as above. Note that this dependency is not specified as "provided", which makes it possible to launch the application right in the IDE, and even debug using breakpoints in driver code when launched in client mode.

Configuration

Configurations for your CueSheet application, including Spark configurations and the arguments in spark-submit, are specified using the HOCON format. It is by default application.conf in your classpath root, but an alternate configuration file can be specified using -Dconfig.resource or -Dconfig.file. Below is an example configuration file.

spark { master = "yarn:classpath:com.kakao.cuesheet.launcher.test" deploy.mode = cluster hadoop.user.name = "cloudera" executor.instances = 2 executor.memory = 1g driver.memory = 1g streaming.blockInterval = 10000 eventLog.enabled = false eventLog.dir = "hdfs:///user/spark/applicationHistory" yarn.historyServer.address = "http://history.server:18088" driver.extraJavaOptions = "-XX:MaxPermSize=512m"}

Unlike the standard spark configuration, spark.master for YARN should include an indicator for finding YARN/Hive/Hadoop configurations. It is the easiest to put the XML files inside your classpath, usually by putting them under src/main/resources, and specify the package classpath as above. Alternatively, spark.master can contain a URL to download the configuration in a ZIP file, e.g. yarn:http://cloudera.manager/hive/configuration.zip, copied from Cloudera Manager's 'Download Client Configuration' link. The usual local or local[8] can also be used as spark.master.

deploy.mode can be either client or cluster, and spark.hadoop.user.name should be the username to be used as the Hadoop user. CueSheet assumes that this user has the write permission to the home directory.

Using HDFS

While submitting an application to YARN, CueSheet will copy Spark and CueSheet's dependency jars to HDFS. This way, in the next time you submit your application, CueSheet will analyze your classpath to find and assemble only the classes that are not part of the already installed jars.

One-Liner for Easy Deployment

When given a tag name as system property cuesheet.install, CueSheet will print a rather long shell command which can launch your application from anywhere hdfs command is available. Below is an example of the one-liner shell command that CueSheet produces when given -Dcuesheet.install=v0.0.1 as a JVM argument.

rm -rf SimpleExample_2.10-v0.0.1 && mkdir SimpleExample_2.10-v0.0.1 && cd SimpleExample_2.10-v0.0.1 &&echo 'dfs.ha.automatic-failover.enabledfalsefs.defaultFShdfs://quickstart.cloudera:8020' > core-site.xml &&hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/applications/com.kakao.cuesheet.SimpleExample/v0.0.1/SimpleExample_2.10.jar \!SimpleExample_2.10.jar &&hdfs --config . dfs -get hdfs:///user/cloudera/.cuesheet/lib/0.10.0-SNAPSHOT-scala-2.10-spark-2.1.0/*.jar &&java -classpath "*" com.kakao.cuesheet.SimpleExample "hello" "world" && cd .. && rm -rf SimpleExample_2.10-v0.0.1

What this command does is to download the CueSheet and Spark jars as well as your application assembly from HDFS, and launch the application in the same environment that was launched in the IDE. This way, it is not required to have HADOOP_CONF_DIR or SPARK_HOME properly installed and set on every node, making it much easier to use it in distributed schedulers like Marathon, Chronos, or Aurora. These schedulers typically allow a single-line shell command as their job specification, so you can simply paste what CueSheet gives you in the scheduler's Web UI.

Additional Features

Being started as a library of reusable Spark functions, CueSheet contains a number of additional features, not in an extremely coherent manner. Many parts of CueSheet including these features are powered by Mango library, another open-source project by Kakao.

nearest-neighbor collaborative filteringConnectors to HBase, Couchbase, and ElasticSearch to save RDD data with adjustable client-side throttlingReading an HBase table as an RDDTools for parsing RDDs and DStreams encoded with Apache AvroAn alternate join implementation for skewed datasetResumable Kafka Stream which reads ZooKeeper offset data instead of checkpoints because checkpoint does not allow any changes in application codeWriting DataFrames into an external Hive table or partition

One additional quirk is the "stop" tab CueSheet adds to the Spark UI. As shown below, it features three buttons with an increasing degree of seriousness. To stop a Spark Streaming application, to possibly trigger a restart by a scheduler like Marathon, one of the left two buttons will do the job. If you need to halt a Spark application ASAP, the red button will immediately kill the Spark driver.

License

This software is licensed under the Apache 2 license, quoted below.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at http://apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

标签：root

轻量级前端框架助力开发者提升项目效率与性能

763 2022-10-11

CueSheet：一个以漂亮的方式编写Spark 2.x应用程序的框架

react 前端框架如何驱动企业数字化转型与创新发展

轻量级前端框架助力开发者提升项目效率与性能

angular前端框架如何塑造现代企业的数字化转型之路

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计