Treffer: Distributed reinforcement learning--practical teaching platform.
Weitere Informationen
[Objective] Reinforcement learning (RL) is a machine learning method that enables an agent to learn policies through interactions with its environment, showing great potential in a wide range of applications, particularly in autonomous driving, robotic control, and intelligent nonplayer characters in computer games. Teaching RL to undergraduate students presents several challenges, including high algorithm complexity, substantial computational resource requirements, and extended training time. Many universities have only recently introduced AI majors and courses, frequently lacking the essential computational resources for teaching RL. To address these challenges, this paper develops a distributed RL teaching platform designed to require minimal computational resources. The platform can be deployed in a conventional programming laboratory with one instructor computer equipped with an Intel I7-7700K CPU and an NVIDIA GTX1070TI GPU, along with 100 student machines only outfitted with Intel Core I5-6500CPUs. The platform supports RL teaching at four levels: foundational algorithms, core algorithms, RL paradigms, and specific applications. [Methods] On the software side, the platform predominantly utilizes Python as the development language, leveraging the Gym and OpenCV-Python libraries for task environment development. It standardizes the interface for states and actions, integrates sampling interfaces with training data, and implements interactive sampling functionalities. This platform supports a wide range of algorithms, including fundamental methods, such as Q-learning and state-action-reward-state-action, as well as advanced deep RL networks, like deep Q-network (DQN), proximal policy optimization (PPO), and AlphaZero. Based on its interactive sampling capability, the platform supports RL and inverse RL paradigms. PyTorch is utilized as the deep learning training framework, while TensorBoardX serves as the visualization tool. Regarding the management of training and data exchange, the essential program files for the task environment and RL method, formatted as .py files, together with the .bat file for launching a task, are prepared and distributed to student computers via Lanstar before training. Remote commands are then sent through Lanstar to execute the .bat files to start training. During the training process, all student computers iteratively execute the task with the newest policy. The "action-state-reward" data are continuously collected and transmitted to the teacher's computer via FTP. Once an adequate amount of sampling data is gathered, the teacher's computer updates the policy model and redistributes it to the student computers via FTP. To address the significant time required for model distribution, a hierarchical structure is designed to enhance the speed of this process. [Results] In comparison experiments, we evaluate our platform against a workstation equipped with dual NVIDIA GTX1080TI GPUs across three representative tasks: Grid Box/DQN, Flappy Bird/PPO, and Go/AlphaZero. The results indicate that our platform can reduce the training time for the first two tasks to one-eighth of the time required by the workstation, while for the third task, the training time is reduced to one-third. [Conclusion] This platform facilitates students to practice RL algorithms under the condition of limited computing resources, aids them in comprehending the action-reward learning mechanism and various learning paradigms, and enhances the quality and efficiency of their learning. [ABSTRACT FROM AUTHOR]
Copyright of Experimental Technology & Management is the property of Experimental Technology & Management Editorial Office and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)