TensorFlow on YARN (TonY) - 在Apache Hadoop上原生运行TensorFlow的框架

网友投稿 698 2022-10-29

TonY is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod. TonY enables running either single node or distributed training as a Hadoop application. This native connector, together with other TonY features, aims to run machine learning jobs reliably and flexibly. For a quick overview of TonY and comparisons to other frameworks, please see this presentation.

Compatibility Notes

It is recommended to run TonY with Hadoop 3.1.1 and above. TonY itself is compatible with Hadoop 2.7.4 and above. If you need GPU isolation from TonY, you need Hadoop 3.1.0 or higher.

Build

How to build

TonY is built using Gradle. To build TonY, run:

./gradlew build

This will automatically run tests, if want to build without running tests, run:

./gradlew build -x test

The jar required to run TonY will be located in ./tony-cli/build/libs/.

Publishing (for admins)

Follow this guide to generate a key pair using GPG. Publish your public key.

Create a Nexus account at https://oss.sonatype.org/ and request access to publish to com.linkedin.tony. Here's an example Jira ticket: https://issues.sonatype.org/browse/OSSRH-47350.

Configure your ~/.gradle/gradle.properties file:

# signing plugin uses thesesigning.keyId=...signing.secretKeyRingFile=/home//.gnupg/secring.gpgsigning.password=...# maven repo credentialsmavenUser=...mavenPassword=...# gradle-nexus-staging-plugin uses thesenexusUsername=nexusPassword=

Now you can publish and release artifacts by running ./gradlew publish closeAndReleaseRepository.

Usage

TonY is a Java library, so it is as simple as running a Java program. There are two ways to launch your deep learning jobs with TonY:

Use Docker container.Use a zipped Python virtual environment.

Use a Docker container

Note that this requires you have a properly configured Hadoop cluster with Docker support. Check this documentation if you are unsure how to set it up. Assuming you have properly set up your Hadoop cluster with Docker container runtime, you should have already built a proper Docker image with required Hadoop configurations. The next thing you need is to install your Python dependencies inside your Docker image - TensorFlow or PyTorch.

Below is a folder structure of what you need to launch the job:

MyJob/ > src/ > models/ mnist_distributed.py tony.xml tony-cli-0.1.5-all.jar

The src/ folder would contain all your training script. The tony.xml is used to config your training job. Specifically for using Docker as the container runtime, your configuration should be similar to something below:

$ cat MyJob/tony.xml tony.worker.instances 4 tony.worker.memory 4g tony.worker.gpus 1 tony.ps.memory 3g tony.docker.enabled true tony.docker.containers.image YOUR_DOCKER_IMAGE_NAME

For a full list of configurations, please see the wiki.

Now you're ready to launch your job:

$ java -cp "`hadoop classpath --glob`:MyJob/*:MyJob/" \ com.linkedin.tony.cli.ClusterSubmitter \ -executes models/mnist_distributed.py \ -task_params '--input_dir /path/to/hdfs/input --output_dir /path/to/hdfs/output' \ -src_dir src \ -python_binary_path /home/user_name/python_virtual_env/bin/python

Use a zipped Python virtual environment

The difference between this approach and the one with Docker is

You don't need to set up your Hadoop cluster with Docker support.There is no requirement on a Docker image registry.

As you know, nothing comes for free. If you don't want to bother setting your cluster with Docker support, you'd need to prepare a zipped virtual environment for your job and your cluster should have the same OS version as the computer which builds the Python virtual environment.

Python virtual environment in a zip

$ unzip -Z1 my-venv.zip | head -n 10 Python/ Python/bin/ Python/bin/rst2xml.py Python/bin/wheel Python/bin/rst2html5.py Python/bin/rst2odt.py Python/bin/rst2s5.py Python/bin/pip2.7 Python/bin/saved_model_cli Python/bin/rst2pseudoxml.pyc

TonY jar and tony.xml

MyJob/ > src/ > models/ mnist_distributed.py tony.xml tony-cli-0.1.5-all.jar my-venv.zip # The additional file you need.

A similar tony.xml but without Docker related configurations:

$ cat tony/tony.xml tony.worker.instances 4 tony.worker.memory 4g tony.worker.gpus 1 tony.ps.memory 3g

Then you can launch your job:

$ java -cp "`hadoop classpath --glob`:MyJob/*:MyJob" \ com.linkedin.tony.cli.ClusterSubmitter \ -executes models/mnist_distributed.py \ # relative path to model program inside the src_dir -task_params '--input_dir /path/to/hdfs/input --output_dir /path/to/hdfs/output \ -python_venv my-venv.zip \ -python_binary_path Python/bin/python \ # relative path to the Python binary inside the my-venv.zip -src_dir src

TonY arguments

The command line arguments are as follows:

Name	Required?	Example	Meaning
executes	yes	--executes model/mnist.py	Location to the entry point of your training code.
src_dir	yes	--src src/	Specifies the name of the root directory locally which contains all of your python model source code. This directory will be copied to all worker node.
task_params	no	--input_dir /hdfs/input --output_dir /hdfs/output	The command line arguments which will be passed to your entry point
python_venv	no	--python_venv venv.zip	Path to the zipped local Python virtual environment
python_binary_path	no	--python_binary_path Python/bin/python	Used together with python_venv, describes the relative path in your python virtual environment which contains the python binary, or an absolute path to use a python binary already installed on all worker nodes
shell_env	no	--shell_env LD_LIBRARY_PATH=/usr/local/lib64/	Specifies key-value pairs for environment variables which will be set in your python worker/ps processes.
conf_file	no	--conf_file tony-local.xml	Location of a TonY configuration file.
conf	no	--conf tony.application.security.enabled=false	Override configurations from your configuration file via command line

TonY configurations

There are multiple ways to specify configurations for your TonY job. As above, you can create an XML file called tony.xml and add its parent directory to your java classpath.

Alternatively, you can pass -conf_file to the java command line if you have a file not named tony.xml containing your configurations. (As before, the parent directory of this file must be added to the java classpath.)

If you wish to override configurations from your configuration file via command line, you can do so by passing -conf = argument pairs on the command line.

Please check our wiki for all TonY configurations and their default values.

TonY Examples

Below are examples to run distributed deep learning jobs with TonY:

Distributed MNIST with TensorFlowDistributed MNIST with PyTorchLinear regression with MXNetTonY in Google Cloud PlatformTonY in Azkaban video

More information

For more information about TonY, check out the following:

TonY presentation at DataWorks Summit '19 in Washington, D.C.TonY OpML '19 paperTonY LinkedIn Engineering blog post

FAQ

My tensorflow process hangs with 2018-09-13 03:02:31.538790: E tensorflow/core/distributed_runtime/master.cc:272] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS ErrorINFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS ErrorINFO:tensorflow:Graph was finalized.2018-09-13 03:03:33.792490: I tensorflow/core/distributed_runtime/master_session.cc:1150] Start master session ea811198d338cc1d with config: INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: conv1/Variable, conv1/Variable_1, conv2/Variable, conv2/Variable_1, fc1/Variable, fc1/Variable_1, fc2/Variable, fc2/Variable_1, global_step, adam_optimizer/beta1_power, adam_optimizer/beta2_power, conv1/Variable/Adam, conv1/Variable/Adam_1, conv1/Variable_1/Adam, conv1/Variable_1/Adam_1, conv2/Variable/Adam, conv2/Variable/Adam_1, conv2/Variable_1/Adam, conv2/Variable_1/Adam_1, fc1/Variable/Adam, fc1/Variable/Adam_1, fc1/Variable_1/Adam, fc1/Variable_1/Adam_1, fc2/Variable/Adam, fc2/Variable/Adam_1, fc2/Variable_1/Adam, fc2/Variable_1/Adam_1, ready: None Why? Try adding the path to your libjvm.so shared library to your LD_LIBRARY_PATH environment variable for your workers. See above for an example. How do I configure arbitrary TensorFlow job types? Please see the wiki on TensorFlow task configuration for details.

标签：root

暂时没有评论，来抢沙发吧~

TensorFlow on YARN (TonY) - 在Apache Hadoop上原生运行TensorFlow的框架

蔬菜小程序的开发全流程详解

定位小程序的开发指南

线下小程序的开发之路

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计