MLComp - 用于机器学习的分布式DAG(有向无环图)框架

网友投稿 850 2022-11-02

MLComp - 用于机器学习的分布式DAG(有向无环图)框架

MLComp - 用于机器学习的分布式DAG(有向无环图)框架

The goal of MLComp is to provide tools for training, inferencing, creating complex pipelines (especially for computer vision) in a rapid, well manageable way. MLComp is compatible with: Python 3.6+, Unix operation system.

Part of Catalyst Ecosystem. Project manifest.

Features

Amazing UICatalyst supportDistributed trainingSupervisor that controls computational resourcesSynchronization of both code and dataResource monitoringFull functionality of the pause and continue on UIAuto control of the requirementsCode dumping (with syntax highlight on UI)Kaggle integrationHierarchical loggingGrid searchExperiments comparisonCustomizing layout system

Contents

Screenshots Installation UI Usage Docs and examples Environment variables

Screenshots

Dags

Computers

Reports

Code

Graph

More screenshots

Installation

Install MLComp packagesudo apt-get install -y \libavformat-dev libavcodec-dev libavdevice-dev \libavutil-dev libswscale-dev libavresample-dev libavfilter-devpip install mlcompmlcomp initmlcomp migrate Setup your environment. Please consider Environment variables section Run db, redis, mlcomp-server, mlcomp-workers: Variant 1: minimal (if you have 1 computer) Run all necessary (mlcomp-server, mlcomp-workers, redis-server), it uses SQLITE:mlcomp-server start --daemon=True Variant 2: full a. Change your Environment variables to use PostgreSql b. Install rsync on each work computer sudo apt-get install rsync Ensure that every computer is available by SSH protocol with IP/PORT you specified in the Environment variables file. rsync will perform the following commands: to uploadrsync -vhru -e "ssh -p {target.port} -o StrictHostKeyChecking=no" \{folder}/ {target.user}@{target.ip}:{folder}/ --perms --chmod=777 to download rsync -vhru -e "ssh -p {source.port} -o StrictHostKeyChecking=no" \{source.user}@{source.ip}:{folder}/ {folder}/ --perms --chmod=777 c. Install apex for distributed learning d. To Run postgresql, redis-server, mlcomp-server, execute on your server-computer:cd ~/mlcomp/configs/docker-compose -f server-compose.yml up -d e. Run on each worker-computer:mlcomp-worker start

UI

Web site is available at http://{WEB_HOST}:{WEB_PORT}

By default, it is http://localhost:4201

The front is built with AngularJS.

In case you desire to change it, please consider front's Readme page

Usage

Run

mlcomp dag PATH_TO_CONFIG.yml

This command copies files of the directory to the database.

Then, the server schedules the DAG considering free resources.

For more information, please consider Docs

Docs and examples

You can find advanced tutorials and MLComp best practices in the examples folder of the repository.

FileSync tutorial describes data synchronization mechanism

Environment variables

The single file to setup your computer environment is located at ~/mlcomp/configs/.env

ROOT_FOLDER - folder to save MLComp files: configs, db, tasks, etc.TOKEN - site security token. Please change it to any stringDB_TYPE. Either SQLITE or POSTGRESQLPOSTGRES_DB. PostgreSql db namePOSTGRES_USER. PostgreSql userPOSTGRES_PASSWORD. PostgreSql passwordPOSTGRES_HOST. PostgreSql hostPGDATA. PostgreSql db files locationREDIS_HOST. Redis hostREDIS_PORT. Redis portREDIS_PASSWORD. Redis passwordWEB_HOST. MLComp site host. 0.0.0.0 means it is available from everywhereWEB_PORT. MLComp site portCONSOLE_LOG_LEVEL. log level for output to the consoleDB_LOG_LEVEL. log level for output to the databaseIP. Ip of a work computer. The work computer must be accessible from other work computers by these IP/PORTPORT. Port of a work computer. The work computer must be accessible from other work computers by these IP/PORT (SSH protocol)MASTER_PORT_RANGE. distributed port range for a work computer. 29500-29510 means that if this work computer is a master in a distributed learning, it will use the first free port from this range. Ranges of different work computers must not overlap.NCCL_SOCKET_IFNAME. NCCL network interface.FILE_SYNC_INTERVAL. File sync interval in seconds. 0 means file sync is offWORKER_USAGE_INTERVAL. Interval in seconds of writing worker usage to DBINSTALL_DEPENDENCIES. True/False. Either install dependent libraries or notSYNC_WITH_THIS_COMPUTER. True/False. If False, all computers except that will not sync with that oneCAN_PROCESS_TASKS. True/False. If false, this computer does not process tasks

You can see your network interfaces with ifconfig command. Please consider nvidia doc

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:nnU-Net是一个专为医学图像分割而设计的框架
下一篇:Nornir是一个纯Python自动化框架,可以直接从Python使用
相关文章

 发表评论

暂时没有评论,来抢沙发吧~