Toward Competent Robot Apprentices: Enabling Proactive Troubleshooting in Collaborative Robots.

Christopher Thierauf, Theresa Law, Tyler Frasca, Matthias Scheutz. MDPI Machines 2024.

How can robots use dialog to explain, understand, and resolve failure?

You can read the full paper here.

Robots fail. Sensors get occluded, grippers jam, power browns out. The usual pipeline of parsing a command, planning, then executing doesn’t help much once reality diverges from the plan. In this paper I built a “robot apprentice” that treats failure as a signal to reason, explain, and adapt in real time alongside a human partner. The system detects when its performance drifts from expectation, infers what’s likely broken, and either proposes or executes an alternative that still achieves the goal. As it goes, it explains its reasoning.

What “apprentice” means here

The goal here is a robot collaborator that knows its limits, what can be done about them, and can communicate it. That requires a few moving parts:

  • Self-assessment: It tracks action outcomes over time and computes success likelihoods, not just at the single-action level but conditioned on short “windows” of recent behavior (contexts).
  • Costed planning over success probabilities: Plans are chosen by predicted probability of success, not just by length.
  • Dialogue integrated with planning: The robot can answer “What’s the chance you can do X?”, justify refusals, and propose alternatives that preserve intent.
  • Context inference: When observed outcomes become statistically inconsistent with “normal,” it shifts to a context that better explains the data (e.g., gripper jammed), which updates both plan choice and what it tells the human.

How it works (nuts and bolts)

  • Perception & language feed a symbolic state. We’re using DIARC here.
  • Action library has preconditions/effects; we maintain running success models for each action (and for short predecessor windows).
  • Plan scoring: For a candidate plan π = ⟨a₁…aₙ⟩, we estimate P(π)=∏P(success∣context)P(π)=\prod P(\text{success}\mid \text{context})P(π)=∏P(success∣context) and pick the plan with the highest probability.
  • Dialogue reflects that same model: “Probability I can fetch the gearbox top?” If the context implies closeGripper ≈ 0, the answer is 0 and the system proposes scoop instead. The point is consistency: what we do and what we say come from the same beliefs.

Case studies on a Fetch robot

  • Power loss: After an e-stop (motors off, sensing on), the robot’s observed outcomes diverge from normal. It infers a motion-can’t-execute context and answers probability queries accordingly (e.g., “approach” ≈ 0), rather than blindly trying.
  • Jammed gripper: Close fails; context shifts to gripper jammed; the system substitutes scoop for pick up, explains the substitution, and completes the task.
  • Partial failures: With a small jam the robot can still grasp large items but not small ones; it picks the large gear to satisfy “pick a gear,” because that plan has higher success under the current context. This is the point of context-conditioned planning: don’t disable grasp wholesale; use it where it still works.

Does anyone actually prefer this behavior?

We ran an online study with five robot behaviors:

  • A: fails silently.
  • B: fails and merely says it failed.
  • C: fails and explains why (e.g., “gripper jammed”).
  • D: warns it can’t pick up; waits for a suggested alternative.
  • E: warns and proactively executes a viable alternative (e.g., “I can’t pick up, but I can scoop,” then does it).

The goal here was to use human-robot trust metrics, because these metrics provide a sound method of determining someone’s future willingness to interact with the robot. These different conflations test if people prefer robots that are obedient or robots that can collaborate (even if it means not following the instructed plan).

Trust scores (7-point Likert) were significantly higher for C/D/E than A/B. Rankings were brutal but predictable: E was most preferred across ease of operation, proactivity, interaction, and understanding; D was second; C third; B fourth; A last. Translation: people prefer robots that (1) avoid predictable failures, (2) explain, and (3) take initiative to salvage the goal.

What this isn’t

It’s not “LLM as planner”, There’s no LLM involved at any stage. The system stays symbolic where it matters (preconditions/effects, goals, explanations) and uses statistics only to reweight choices and dialogue. That makes it predictable, debuggable, and compatible with established task-planning stacks.

Why I think this matters

Competence beats charisma. If we want robots people actually rely on, they must (a) know when a plan will fail, (b) communicate that succinctly, and (c) propose something better—without waiting to be micromanaged. The apprentice architecture does exactly that: it aligns action selection and explanation under a shared model of expected performance, and it updates that model online when reality changes.