Softlearning是一个强化学习框架，用于训练连续域中的最大熵策略

网友投稿 960 2022-11-05

Softlearning

Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is fairly thin and primarily optimized for our own development purposes. It utilizes the tf.keras modules for most of the model classes (e.g. policies and value functions). We use Ray for the experiment orchestration. Ray Tune and Autoscaler implement several neat features that enable us to seamlessly run the same experiment scripts that we use for local prototyping to launch large-scale experiments on any chosen cloud service (e.g. GCP or AWS), and intelligently parallelize and distribute training for effective resource allocation.

This implementation uses Tensorflow. For a PyTorch implementation of soft actor-critic, take a look at rlkit.

Getting Started

Prerequisites

The environment can be run either locally using conda or inside a docker container. For conda installation, you need to have Conda installed. For docker installation you will need to have Docker and Docker Compose installed. Also, most of our environments currently require a MuJoCo license.

Conda Installation

Download and install MuJoCo 1.50 and 2.00 from the MuJoCo website. We assume that the MuJoCo files are extracted to the default location (~/.mujoco/mjpro150 and ~/.mujoco/mujoco200_{platform}). Unfortunately, gym and dm_control expect different paths for MuJoCo 2.00 installation, which is why you will need to have it installed both in ~/.mujoco/mujoco200_{platform} and ~/.mujoco/mujoco200. The easiest way is to create a symlink from ~/.mujoco/mujoco200_{plaftorm} -> ~/.mujoco/mujoco200 with: ln -s ~/.mujoco/mujoco200_{platform} ~/.mujoco/mujoco200. Copy your MuJoCo license key (mjkey.txt) to ~/.mujoco/mjkey.txt: Clone softlearning

git clone https://github.com/rail-berkeley/softlearning.git ${SOFTLEARNING_PATH}

Create and activate conda environment, install softlearning to enable command line interface.

cd ${SOFTLEARNING_PATH}conda env create -f environment.ymlconda activate softlearningpip install -e ${SOFTLEARNING_PATH}

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

conda deactivateconda remove --name softlearning --all

Docker Installation

docker-compose

To build the image and run the container:

export MJKEY="$(cat ~/.mujoco/mjkey.txt)" \ && docker-compose \ -f ./docker/docker-compose.dev.cpu.yml \ up \ -d \ --force-recreate

You can access the container with the typical Docker exec-command, i.e.

docker exec -it softlearning bash

See examples section for examples of how to train and simulate the agents.

Finally, to clean up the docker setup:

docker-compose \ -f ./docker/docker-compose.dev.cpu.yml \ down \ --rmi all \ --volumes

Examples

Training and simulating an agent

To train the agent

softlearning run_example_local examples.development \ --algorithm SAC \ --universe gym \ --domain HalfCheetah \ --task v3 \ --exp-name my-sac-experiment-1 \ --checkpoint-frequency 1000 # Save the checkpoint to resume training later

To simulate the resulting policy: First, find the absolute path that the checkpoint is saved to. By default (i.e. without specifying the log-dir argument to the previous script), the data is saved under ~/ray_results////-//. For example: ~/ray_results/gym/HalfCheetah/v3/2018-12-12T16-48-37-my-sac-experiment-1-0/mujoco-runner_0_seed=7585_2018-12-12_16-48-37xuadh9vd/checkpoint_1000/. The next command assumes that this path is found from ${SAC_CHECKPOINT_DIR} environment variable.

python -m examples.development.simulate_policy \ ${SAC_CHECKPOINT_DIR} \ --max-path-length 1000 \ --num-rollouts 1 \ --render-kwargs '{"mode": "human"}'

examples.development.main contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag: python ./examples/development/main.py --help

optional arguments: -h, --help show this help message and exit --universe {robosuite,dm_control,gym} --domain DOMAIN --task TASK --checkpoint-replay-pool CHECKPOINT_REPLAY_POOL Whether a checkpoint should also saved the replay pool. If set, takes precedence over variant['run_params']['checkpoint_replay_pool']. Note that the replay pool is saved (and constructed) piece by piece so that each experience is saved only once. --algorithm ALGORITHM --policy {gaussian} --exp-name EXP_NAME --mode MODE --run-eagerly RUN_EAGERLY Whether to run tensorflow in eager mode. --confirm-remote [CONFIRM_REMOTE] Whether or not to query yes/no on remote run. --video-save-frequency VIDEO_SAVE_FREQUENCY Save frequency for videos. --cpus CPUS Cpus to allocate to ray process. Passed to `ray.init`. --gpus GPUS Gpus to allocate to ray process. Passed to `ray.init`. --resources RESOURCES Resources to allocate to ray process. Passed to `ray.init`. --include-webui INCLUDE_WEBUI Boolean flag indicating whether to start theweb UI, which is a Jupyter notebook. Passed to `ray.init`. --temp-dir TEMP_DIR If provided, it will specify the root temporary directory for the Ray process. Passed to `ray.init`. --resources-per-trial RESOURCES_PER_TRIAL Resources to allocate for each trial. Passed to `tune.run`. --trial-cpus TRIAL_CPUS CPUs to allocate for each trial. Note: this is only used for Ray's internal scheduling bookkeeping, and is not an actual hard limit for CPUs. Passed to `tune.run`. --trial-gpus TRIAL_GPUS GPUs to allocate for each trial. Note: this is only used for Ray's internal scheduling bookkeeping, and is not an actual hard limit for GPUs. Passed to `tune.run`. --trial-extra-cpus TRIAL_EXTRA_CPUS Extra CPUs to reserve in case the trials need to launch additional Ray actors that use CPUs. --trial-extra-gpus TRIAL_EXTRA_GPUS Extra GPUs to reserve in case the trials need to launch additional Ray actors that use GPUs. --num-samples NUM_SAMPLES Number of times to repeat each trial. Passed to `tune.run`. --upload-dir UPLOAD_DIR Optional URI to sync training results to (e.g. s3:// or gs://). Passed to `tune.run`. --trial-name-template TRIAL_NAME_TEMPLATE Optional string template for trial name. For example: '{trial.trial_id}-seed={trial.config[run_params][seed] }' Passed to `tune.run`. --checkpoint-frequency CHECKPOINT_FREQUENCY How many training iterations between checkpoints. A value of 0 (default) disables checkpointing. If set, takes precedence over variant['run_params']['checkpoint_frequency']. Passed to `tune.run`. --checkpoint-at-end CHECKPOINT_AT_END Whether to checkpoint at the end of the experiment. If set, takes precedence over variant['run_params']['checkpoint_at_end']. Passed to `tune.run`. --max-failures MAX_FAILURES Try to recover a trial from its last checkpoint at least this many times. Only applies if checkpointing is enabled. Passed to `tune.run`. --restore RESTORE Path to checkpoint. Only makes sense to set if running 1 trial. Defaults to None. Passed to `tune.run`. --with-server WITH_SERVER Starts a background Tune server. Needed for using the Client API. Passed to `tune.run`. --server-port SERVER_PORT Port number for launching TuneServer. Passed to `tune.run`.

Resume training from a saved checkpoint

This feature is currently broken!

In order to resume training from previous checkpoint, run the original example main-script, with an additional --restore flag. For example, the previous example can be resumed as follows:

References

The algorithms are based on the following papers:

Soft Actor-Critic Algorithms and Applications. Tuomas Haarnoja*, Aurick Zhou*, Kristian Hartikainen*, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. arXiv preprint, 2018. paper | videos

Latent Space Policies for Hierarchical Reinforcement Learning. Tuomas Haarnoja*, Kristian Hartikainen*, Pieter Abbeel, and Sergey Levine. International Conference on Machine Learning (ICML), 2018. paper | videos

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. International Conference on Machine Learning (ICML), 2018. paper | videos

Composable Deep Reinforcement Learning for Robotic Manipulation. Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, Sergey Levine. International Conference on Robotics and Automation (ICRA), 2018. paper | videos

Reinforcement Learning with Deep Energy-Based Policies. Tuomas Haarnoja*, Haoran Tang*, Pieter Abbeel, Sergey Levine. International Conference on Machine Learning (ICML), 2017. paper | videos

If Softlearning helps you in your academic research, you are encouraged to cite our paper. Here is an example bibtex:

@techreport{haarnoja2018sacapps, title={Soft Actor-Critic Algorithms and Applications}, author={Tuomas Haarnoja and Aurick Zhou and Kristian Hartikainen and George Tucker and Sehoon Ha and Jie Tan and Vikash Kumar and Henry Zhu and Abhishek Gupta and Pieter Abbeel and Sergey Levine}, journal={arXiv preprint arXiv:1812.05905}, year={2018}}

标签：root

暂时没有评论，来抢沙发吧~

Softlearning是一个强化学习框架，用于训练连续域中的最大熵策略

蔬菜小程序的开发全流程详解

定位小程序的开发指南

线下小程序的开发之路

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计