XLearning是一款支持多种机器学习、深度学习框架的调度系统

网友投稿 1227 2022-11-05

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.

中文文档

Architecture

Client: start and get the state of the application.ApplicationMaster(AM): the role for the internal schedule and lifecycle manager, including the input data distribution and containers management.Container: the actual executor of the application to start the progress of Worker or PS(Parameter Server), monitor and report the status of the progress to AM, and save the output, especially start the TensorBoard service for TensorFlow application.

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

XLearning is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or xlearning.input.strategy configuration. XLearning support three ways to read the HDFS input data:

Download: AM traverses all files under the specified HDFS path and distributes data to workers in files. Each worker download files from the remote to local.Placeholder: The difference with Download mode is that AM send the related HDFS file list to workers. The process in worker read the data from HDFS directly.InputFormat: Integrated the InputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of InputFormat for the input data. AM splits the input data and assigns fragments to the different workers. Each worker passes the assigned fragments through the pipeline to the execution progress.

Similar with the read strategy, XLearning allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or xlearning.output.strategy configuration. There are two kinds of result output modes:

Upload: After the program finished, each worker upload the local directory of the output to specified HDFS path directly. The button, "Saved Model", on the web interface allows user to upload the intermediate result to remote during the execution.OutputFormat: Integrated the OutputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of OutputFormat for saving the result to HDFS.

3 Visualization Display

The application interface can be divided into four parts:

All Containers：display the container list and corresponding information, including the container host, container role, current state of container, start time, finish time, current progress.View TensorBoard：If set to start the service of TensorBoard when the type of application is TensorFlow, provide the link to enter the TensorBoard for real-time view.Save Model：If the application has the output, user can upload the intermediate output to specified HDFS path during the execution of the application through the button of "Save Model". After the upload finished, display the list of the intermediate saved path.Worker Metrix：display the resource usage information metrics of each worker. As shown below:

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at XLearning directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

jdk >= 1.7Maven >= 3.3

2 Compilation Method

Run the following command in the root directory of the source code:

mvn package

After compiling, a distribution package named xlearning-1.1-dist.tar.gz will be generated under target in the root directory. Unpacking the distribution package, the following subdirectories will be generated under the root directory:

bin: scripts for application commitlib: jars for XLearning and dependenciesconf: configuration filessbin: scripts for history servicedata: data and files for examplesexamples: XLearning examples

3 Deployment Environment Requirements

CentOS 7.2Java >= 1.7Hadoop = 2.7(Hadop GPU version)[optional] Dependent environment for deep learning frameworks at the cluster nodes, such as TensorFlow, numpy, Caffe.

4 XLearning Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$XLEARNING_HOME", configure the related files:

xlearning-env.sh: set the environment variables, such as:JAVA_HOMEHADOOP_CONF_DIR xlearning-site.xml: configure related properties. Note that the properties associated with the history service needs to be consistent with what has configured when the history service started.For more details, please see the Configuration part。 log4j.properties：configure the log level

5 Start Method of XLearning History Service [Optional]

run $XLEARNING_HOME/sbin/start-history-server.sh.

Quick Start

Use $XLEARNING_HOME/bin/xl-submit to submit the application to cluster in the XLearning client. Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $XLEARNING_HOME hadoop fs -put data /tmp/

2 submit

cd $XLEARNING_HOME/examples/tensorflow$XLEARNING_HOME/bin/xl-submit \ --app-type "tensorflow" \ --app-name "tf-demo" \ --input /tmp/data/tensorflow#data \ --output /tmp/tensorflow_model#model \ --files demo.py,dataDeal.py \ --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" \ --worker-memory 10G \ --worker-num 2 \ --worker-cores 1 \ --worker-gpus 1 \ --ps-memory 1G \ --ps-num 1 \ --ps-cores 2 \ --queue default \

The meaning of the parameters are as follows:

Property Name	Meaning
app-name	application name as "tf-demo"
app-type	application type as "tensorflow"
input	input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output	output file，HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files	application program and required local files, including demo.py, dataDeal.py
launch-cmd	execute command
worker-memory	amount of memory to use for the worker process is 10GB
worker-num	number of worker containers to use for the application is 2
worker-cores	number of cores to use for the worker process is 1
worker-gpus	number of gpus to use for the worker process is 1
ps-memory	amount of memory to use for the ps process is 1GB
ps-num	number of ps containers to use for the application is 1
ps-cores	number of cores to use for the ps process is 2
queue	the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

XLearning FAQ

标签：root

暂时没有评论，来抢沙发吧~

XLearning是一款支持多种机器学习、深度学习框架的调度系统

蔬菜小程序的开发全流程详解

定位小程序的开发指南

线下小程序的开发之路

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计