Google AI researchers propose a meta-algorithm, Jump Start Reinforcement Learning, which uses prior policies to create a learning program that improves performance
This research summary is based on the paper 'Jump-Start Reinforcement Learning' Please don't forget to join our ML Subreddit
In the field of artificial intelligence, reinforcement learning is a type of machine learning strategy that rewards desirable behaviors while penalizing undesirable ones. An agent can perceive its environment and act accordingly by trial and error in general with that form or presence – it’s a bit like getting feedback on what works for you. However, learning rules from scratch in contexts with complex exploration problems is a big challenge in RL. Since the agent does not receive any intermediate incentive, it cannot determine how close it is to achieving the goal. As a result, randomly exploring the space becomes necessary until the door opens. Given the length of the task and the level of precision required, this is highly unlikely.
Random exploration of the state space with preliminary information should be avoided when performing this activity. This prior knowledge helps the agent to determine which states of the environment are desirable and should be investigated further. Offline data collected by human demos, programmed policies, or other RL agents could be used to form a policy and then initiate a new RL policy. This would include the neural network copy of the pre-trained policy into the new policy RL in the scenario where we are using neural networks to describe the procedures. This process transforms the new RL policy into a pre-formed policy. However, as seen below, naive initialization of a new RL policy like this frequently fails, especially for value-based RL approaches.
Google AI researchers have developed a meta-algorithm to leverage pre-existing policy to initialize any RL algorithm. Researchers use two procedures to learn tasks in Jump-Start Reinforcement Learning (JSRL): a guidance policy and an exploration policy. The exploration policy is an RL policy formed online from the agent’s new experiences in the environment. In contrast, the guide policy is any pre-existing policy that is not changed during the online training. JSRL produces a learning program by incorporating the guide policy, followed by the self-improving exploration policy, yielding results comparable to or better than competing IL+RL approaches.
How did the researchers approach the problem?
The guidance policy can take any form:
- A scripted policy, a policy trained with RL
- A living human demonstrator.
The only conditions are that the support policy be fair and capable of selecting actions based on environmental observations. In an ideal world, guidance policy would achieve bad or mediocre ecological performance, but it could not improve further with fine tuning. JSRL can then use the progress of this guide policy to further improve performance.
The guidance policy is deployed for a set number of steps at the start of the training to bring the agent closer to the goal states. After that, the exploration policy takes over and continues to act in the environment to achieve these objectives. The number of steps performed by the guiding policy is steadily reduced as the mining performance policy increases until the mining policy completely takes over. This procedure generates a curriculum of starting states for the exploration policy such that each stage of the curriculum simply requires learning to reach the initial conditions of the previous stages of the curriculum.
How does it compare to the IL + RL guidelines?
Since JSRL can use a previously established policy to initialize RL, this is a natural comparison to imitation and reinforcement learning (IL+RL) methods, which train on offline datasets. before refining the pre-formed policies with a new online experience. On D4RL benchmark tasks, JSRL compares to competing IL+RL approaches. Simulated robotic control environments and collections containing offline data from human demonstrations, planners and other learned policies are among the tasks.
An offline dataset is learned and refined in online mode for each experiment. It is also compared to algorithms such as AWAC, IQL, CQL and behavioral cloning, created specifically for each environment. While JSRL can be used in conjunction with any initial guide policy or fine-tuning method, IQL is used as a pre-trained guide for fine-tuning. Each transition is a sequence of format (S, A, R, S’) which defines the state in which the agent started (S), the action performed by the agent (A), the reward that the agent won (R) and the state the agent ended up in (S’) after completing action A. With as few as ten thousand offline transitions, JSRL seems to work well.
Vision-based robotic tasks:
Due to the curse of dimensionality, using offline data in complex tasks such as vision-based robotic manipulation is a challenge. In terms of the amount of data needed to learn good policies, the high dimensionality of the continuous control action space and pixel-based state space presents scaling issues for approaches IL + RL. To compare it to JSRL tasks, the researchers focus on two challenging simulated robotic manipulation tasks: blind grasping (i.e. lifting any object) and instance grasping (i.e. i.e. lifting a specific target object). The QT-Opt+JSRL combination improves faster than any other strategy while having the highest success rate.
The researchers’ algorithm generates a learning program by incorporating a pre-existing guidance policy, followed by a self-improving exploration policy. Since it starts exploring from states closer to the goal, the exploration policy’s task is greatly simplified. The effect of the guidance policy decreases as the exploration policy grows, resulting in a competent RL policy. The team hopes to use JSRL to solve problems like Sim2Real in the future and see how they can use various guidance policies to teach RL agents.
For Advertisement or Content Creation Service, Please Contact Us at [email protected] or check out our ad page here