UC Berkeley and Google AI researchers present “Director”: a reinforcement learning agent that learns hierarchical behaviors from pixels by planning in the latent space of a learned world model

Researchers from UC Berkeley and Google AI present “Director”: a reinforcement learning agent that learns hierarchical behaviors from pixels by planning in the latent space of a learned world model. The world model that Director builds from pixels enables efficient planning in latent space. To anticipate future model states based on future actions, the world model first maps the images to the model states. Director optimizes two policies based on the anticipated trajectories of the model states: at each predetermined number of steps, the management selects a new goal and the employee learns to achieve the goals using simple activities. Management would have a difficult control challenge if they had to choose planes directly in the high-dimensional continuous representation space of the world model. To reduce the size of the discrete codes created by model states, they instead learn a goal autoencoder. The goal autoencoder then transforms the discrete codes into model states and passes them as goals to the worker after the manager chooses them.

Advances in deep reinforcement learning have accelerated the study of decision-making in artificial agents. Artificial agents can actively affect their environment by moving a robot arm based on camera inputs or clicking a button in a web browser, unlike generative ML models such as GPT-3 and Imagen. Although artificial intelligence has the potential to increasingly help humans, existing approaches are limited by the need for specific feedback in the form of often-given rewards for acquiring effective techniques. For example, even rugged computers like AlphaGo are limited to a certain number of moves before earning their next reward while having access to massive computing resources.

On the contrary, complex activities such as preparing a meal require decision-making at all levels, from planning the menu to following instructions at the store to purchase supplies to correctly executing the fine motor skills required to each step of the journey based on high-dimensional sensory inputs. Artificial agents can complete tasks more independently with little incentive thanks to hierarchical reinforcement learning (HRL), which automatically breaks down complicated tasks into achievable subgoals. Research on HRL has however been difficult because there is no one-size-fits-all answer and existing approaches rely on manually defined target spaces or subtasks.

The manager forms a managerial policy to suggest subgoals within the latent space of a scholarly world model, and a worker policy is formed to carry out these proposals. Although they work with latent representations, they can inspect and analyze the director’s choices by translating his internal sub-goals into images. They assess the director using a variety of benchmarks, demonstrating that they can learn various hierarchical techniques and solve problems with very little reward. Where other methods have failed, such as controlling quadrupedal robots to navigate 3D mazes directly from first-person pixel inputs.

By automatically subdividing complicated long-range tasks, the director learns to do them. Each panel displays the decoded internal objectives on the right and the interaction with the environment on the left.

Since Director simultaneously optimizes each component, the manager learns to choose goals that the employee can achieve. The manager can choose goals that will maximize both the reward for work and an exploration bonus, encouraging the agency to investigate and direct its attention to remote areas of the environment. They found that a simple and effective exploration bonus favors model states when the target autoencoder experiences a large prediction error. Unlike other approaches like feudal networks, their worker learns only by maximizing the feature space similarity between the current state of the model and the goal.

Benchmark results

Director operates in an end-to-end RL environment. Unlike previous HRL work that frequently used tailored assessment methods, such as supporting various practice goals, accessing agents’ overall position on a 2D map, or using distance rewards ground truth. They provide the Egocentric Ant Maze hard benchmark for assessing the ability to investigate and solve long-term challenges. In this set of challenging activities, the joints of a quadruped robot must be controlled using only proprioceptive and first-person camera inputs to locate and reach targets in 3D mazes. The meager reward is only issued when the robot achieves the goal, forcing agents to explore independently for most of their learning period as there is no task reward.

They compare Director to two state-of-the-art algorithms based on global models: Plan2Explore, which maximizes task reward and an exploration bonus determined by ensemble disagreement, and Dreamer, which simply maximizes task reward. From imagined global model trajectories, both baselines learn non-hierarchical policies. Plan2Explore noisily flips the robot onto its back, preventing it from reaching its goal. Dreamer successfully completes the shorter maze but fails to navigate the longer ones. Director is the only way to consistently locate and reach the destination in these larger mazes.

They suggest the Visual Pin Pad suite to investigate how agents might find extremely few rewards regardless of the difficulty of learning 3D environment representations. In these tasks, the agent moves a black square on the workspace so that he can walk on pads of different colors. Long-term memory is not required as the history of previously activated pads is displayed at the bottom of the screen. The spy must figure out how to start each pad to get the small payout. Once again, Director works much better than previous techniques.

Researchers can assess agents using the Visual Pin Pad benchmark even when there are few rewards available and without the interference of issues such as long-term memory or 3D object perception.

They examine the director’s performance on various tasks often seen in literature, which generally do not require lengthy investigations, in addition to completing missions with few rewards. Their experience consists of 12 activities, including DMLab Maze Settings, Control Suite Exercises, Atari Games, and Crafter Research Platform. The fact that the director completes each of these tasks successfully while using the same hyperparameters shows how reliable the hierarchy learning process is. Additionally, by giving the worker the reward for the work, the director can learn the exact movements needed to perform the task, fully matching or even surpassing the performance of the most advanced Dreamer program.

View goals

The learned world model translates the states of the latent model that Director uses as goals into understandable visuals. To better understand how Director makes decisions, they describe its internal goals in various contexts. They discover that Director uses a variety of techniques to break down long-range tasks. For example, management asks for a forward stance and modified floor patterns on the Walker and Humanoid jobs, and the employee fills in the details of how the legs should move. The manager of the self-centered ant maze directs the robot ant by asking for a series of different wall colors. Management motivates the worker in DMLab mazes via the teleportation movement that takes place shortly after collecting the requested object. In contrast, in the 2D Crafter research platform, the manager requests the collection of resources and tools via the inventory display at the bottom of the screen.

In Egocentric Ant Maze XL, the manager guides the employee through the maze by aiming at different colored walls. Right: In Visual Pin Pad Six, the manager designates sub-goals by highlighting multiple pads and using the history display at the bottom.

In Walker, the manager asks the employee to adopt a forward leaning position with both feet off the ground and a moving floor pattern. The employee then fills in the details of the leg movement. Right: The director learns to get up and walk steadily from pixels without early episode breaks in the difficult Humanoid assignment.

Itinerary ahead

They see Director as a development in HRL research and are almost ready to share its source code. The research community can use Director as a useful starting point for developing hierarchical artificial agents in the future by allowing goals to match only parts of the set of representation vectors, dynamically learning the duration of goals and creating line officers at three or more levels. of temporal abstraction. We are confident that the following HRL algorithmic developments will enable intelligent agents to operate at unprecedented levels of performance and autonomy.

This Article is written as a summary article by Marktechpost Staff based on the research paper 'Deep Hierarchical Planning from Pixels'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and project.

Please Don't Forget To Join Our ML Subreddit

James G. Williams