Train RL agents on direct environments to fix symbolic plans

Christopher Thierauf, Matthias Scheutz. IEEE IROS 2024.

We can redesign the typical reinforcement learning pipeline to train faster and integrate with symbolic plans by training on the environment directly, not the robot within it.

You can read the paper here.

Robots are good at following plans until something goes wrong. Symbolic planners excel at long-horizon reasoning (“open the drawer, then grab the block”), but they’re often brittle. If the plan misses a detail or reality doesn’t line up, execution stalls. Reinforcement learning (RL) can adapt on the fly, but it’s usually slow, opaque, and tied to a specific robot’s body.

This paper was about bridging the two, as a neurosymbolic method: letting RL act as a creative problem-solver inside a symbolic framework, using a novel method to construct the reinforcement learning problem and interact with environment.

The Core Idea: Object-Centric Action and Observation Spaces

Most RL policies act in joint space (what angles should my robot’s arm joints move to?). We’ve seen how remarkably effective that is at creating complex fine motor control policies. But that’s low-level, platform-specific, and hard to transfer. It’s particularly odd to do this in basic grasping scenarios when we’ve already had pretty good “traditional” solutions for decades: inverse kinematics, and the code implementing it, has been pretty well understood for a while now. So it seems odd to me that we keep re-creating that through reinforcement learning.

Instead, I flipped it around:

The observation space of the reinforcement learning agent is the positions of all objects in the scene.
The action space of the reinforcement learning agent is forces applied to each object.
Symbolic actions are defined as object trajectories (“move the drawer 10 cm out,” “place the block on the table”), not motor commands.
Symbolic states are defined in terms of objects’ positions, velocities, and constraints.

This means the RL agent doesn’t need to learn kinematics or how to push/pull. It only needs to learn what should happen to the environment. A separate mapping layer converts object actions back into robot motions.

That gets us faster training, symbolically grounded policies, and easy transfer between each robot (assuming they have the same symbolic grounding).

How It Fits with Symbolic Planning

Here’s the workflow:

A symbolic planner tries to solve a task (using PDDL-style logic).
If it hits a gap (in these demonstrations, it’s an operator it doesn’t have), RL gets called in. We use the ability to observe the symbolic state we’re currently in, and the state we aim to be in, as a start and a sparse binary reward.
RL learns how to bridge that gap in object space.
The new skill is brought back into the symbolic framework, with preconditions/effects so it can be reused later.

That makes RL a plan repair tool. Instead of throwing away a symbolic plan when it fails, the robot can patch it with a new learned behavior.

Why This Works Better

Training is faster. The agent doesn’t waste time learning robot physics; it only learns what matters for the task.
Policies stay symbolic. Because the rewards are defined in first-order logic, the results can be explained, reused, and integrated into higher-level plans.
Transfer is easier. The same “open drawer” policy can be executed by a Kinova arm or a Fetch robot, because it’s defined in terms of the drawer, not the arm.

There are tradeoffs. You need both a symbolic domain and a physics model in simulation. And you still need robot-specific implementations of basic actions (like “move arm to grasp”). But the payoff is that you get creative, explainable problem-solving when plans break down.

Experiments: Drawers and Blocks

I tested this idea with a deceptively simple task: open a drawer and pull out a block. This sounds easy, but is very hard for a traditional reinforcement learning agent because it has several complex steps.

In simulation, RL learned the sequence of object interactions.
On a Kinova Gen3, the learned policy was mapped into MoveIt-based actions: sweep the drawer open, grab the block, place it on the table.
Then, with no retraining, the same policy transferred to a Fetch mobile manipulator, which executed the same symbolic behaviors.

We even added a language layer: using an LLM (Mixtral) to translate symbolic effects into natural language descriptions (“I will open the drawer,” “I will place the block”). That showed how these hybrid policies can be communicated to humans.

Why This Matters

Robots in the real world can’t rely on brittle plans, nor can they afford to spend days training new policies every time something changes. By treating RL as a creative repair mechanism inside symbolic planning, and by working in object-centric action spaces, we get the best of both worlds.

Christopher Thierauf, PhD

Fixing Symbolic Plans with Reinforcement Learning in Object-Based Action Spaces.

The Core Idea: Object-Centric Action and Observation Spaces

How It Fits with Symbolic Planning

Why This Works Better

Experiments: Drawers and Blocks

Why This Matters

More posts

A Gentle Introduction to Symbolic Planning and Reasoning

Deployment and Development of a Cognitive Teleoreactive Framework for Deep Sea Autonomy.

Design of a Depth Triggered Mechanical Fuse.

Robots that Learn to Solve Symbolic Novelties with Self-Generated RL Simulations.