What Matters in Designing Resilient Systems

Christopher Thierauf, Matthias Scheutz, IEEE ERAS 2025.

How should our autonomous systems be designed around failure? We argue that we shouldn’t design around failure as a special case at all: instead, failure should be treated as just another state to plan around, and we must instead interpret and integrate failure into our planning problem.

When a robot is working in your warehouse and something goes wrong, you can send a technician over. When a robot is sitting on the seafloor six kilometers deep, you can’t. AUV Sentry (the autonomous underwater vehicle owned and operated by NDSF and WHOI, primarily for oceanographic research) spends its dives entirely on its own, far beyond our reach. Further, communications are highly limited: acoustic methods are available but extremely limited. And, as we try to scale these systems up, human intervention will be even less viable. We need robust autonomy, and deep-sea autonomy is a stress test for the whole question of how robots should handle failure.

We argue that robots need to handle failure by understanding it and its impact. Once a robot understands what is going on and how it relates to its mission, it can try to resolve, repair, or recover.

Why Existing Taxonomies Fall Short

There’s already a useful body of work on classifying robot failures, but most of it stops at diagnostics. You’ll find taxonomies that split failures into “hardware” versus “software,” or that borrow from the human-factors world (for example: the Swiss Cheese model of accidents, or the SHERPA model of human error). Others are closer to robotics and carefully catalog the ways a platform’s subsystems can fail in the field.

The trouble is that these frameworks are written for the human technician reading the diagnostic report afterward, not for the robot that has to keep operating in the moment. These taxonomies don’t explore what a robot in-the-moment is able to or interpret to make the most of the current event. A few do consider system response, but they tend to assume a human is in the loop: like designing in redundancy ahead of time, or re-tuning behavior after the fact. That assumption breaks down the instant your robot is somewhere a human can’t reach.

There’s a second gap, too. Real failures are rarely a single clean chain of cause and effect. A power-regulation problem cascades into a string of sensor and actuator failures; a sensor that “fails at depth” might actually be misconfigured to power off on a timer. Most taxonomies treat failure as one cause leading to one effect, when in practice a single failure is tangled up with multiple states and conditions that all need to be accounted for.

We wanted something different: a taxonomy built for an autonomous, problem-solving agent, and one grounded in failures that actually happened rather than ones we imagined.

Learning from ~450 Real Dives

To do that, we went to the data. The National Deep Submergence Facility produces a “dive report” for every Sentry deployment. There are hundreds of these PDFs, covering dives that typically run 12 to 24 hours each. Each report contains structured statistics (launch coordinates, battery use, dive time) and, more valuably for us, a written narrative paragraph describing what the dive was supposed to do, what actually happened, why it ended, and anything else worth recording for future users.

Here’s a lightly trimmed sample of one of these narratives (from Sentry510):

Sentry’s primary objective was a stepped-altitude chemical survey of Cathedral Hill. […] The outboard methane sensor attached to the vehicle recovery hook stinger caused control issues at speeds over 0.6 m/s due to excessive, asymmetric drag. Sentry was able to complete the desired altitude levels, but the tracklines wander considerably. Following surveys will restrict forward speed to 0.6 m/s or slower.

That single paragraph contains a cause (a poorly-placed sensor creating drag), a condition (it only matters above 0.6 m/s), and an effect (the tracklines wander, but the mission still finishes). This is a rich dataset, but challenging to parse automatically.

Rather than hand-decoding every dive, we built a pipeline. A Python script parses the report PDFs and drops the statistics into a SQL database. The narratives went to large language models (we used both GPT-4o and Claude Opus 4.7 to reduce error, in addition to human inspection on each output) prompted with a battery of structured questions: Did this mission succeed or fail? What caused the failure (hardware, software, environmental, human)? Was it constant or intermittent? What was its effect? Where the two models disagreed, we corrected by hand; they fully agreed on 264 of 387 multi-step categorizations, and we resolved the rest ourselves.

One detail from that process turned out to matter a lot. Almost every disagreement was the same kind: GPT-4o would flag a dive as a “failure,” and Claude would correctly notice that the mission actually succeeded despite the failure. This is a key observation: failure events are not the same as mission failures, and this is one of the central findings of the paper. All told, the dataset spans Sentry deployments 300 through 750: roughly 5,400 deployed hours and over 12,000 kilometers traveled.

A Three-Dimensional Taxonomy: Cause, Condition, Effect

The framework we landed on describes any failure event along three axes. These axes are independent. The point of keeping them separate is that each axis answers a different question, and linking them is what lets a planner actually reason about the failure.

Cause: what went wrong at the root. We sort causes into four origins:

Hardware failures: mechanical or electrical, like a sensor that floods or an actuator that stops responding.
Software failures: bugs, crashes, memory overflows, but also subtler things like a correctly-running component that simply isn’t passing critical information to another.
Environmental disruptions: currents that push the vehicle off its planned path, weather that forces a change of plan, contact with the seabed.
Human errors: mis-specified survey points, mis-interpreted instructions, logistical constraints. Notably, unlike human-robot-interaction taxonomies, we only care about humans as a source of error, not about how the failure is perceived by a human.

Condition: how the failure relates to the robot’s symbolic state. This is the most distinctive axis, and the one a single failure can hit in multiple ways at once:

Frequency: is the fault constant within some interval, or intermittent? (A thruster constantly kicking up sediment versus two sonars that interfere only on a particular clock cycle.)
State: which facts about the robot or environment are tied to the failure. Ideally ones already encoded in the system’s domain model so it can reason about them.
Relevance: does the relevant state occur before, during, or after the failure? This is what lets you start inferring causality: a power surge that precedes a controller failure, versus two systems that interfere while running simultaneously.
Traceability: does a correlation actually reflect causation? A sensor that “fails with depth” might really be misconfigured to shut off on a timer. Telling these apart points to completely different fixes.

Effect: what the failure does to the mission. Three categories, roughly in order of severity:

Abortive: a critical system is gone and the mission can’t be recovered, e.g., a navigation sensor failing so badly that position can’t be corrected from the surface.
Deviative: behavior departs from what was expected (veering off path, missing a target) but is often recoverable, like re-tuning a control policy to compensate for a weak thruster.
Degradative: nothing visibly breaks yet, but there are longer-term consequences. Perhaps a badly-tuned PID straining the hardware, or poor line-following quietly draining the battery in a way that bites you at the end of the dive.

The value in this is the way each axes forces the system to ground failure in observable facts that can be linked and reasoned over. Once you can say “this cause, under these conditions, produces this effect,” you have something a task planner can use.

What the Data Showed

Running the whole Sentry dataset through this lens produced a few results worth sitting with. Of 445 reviewed deployments:

57% had no issues at all, before, during, or after.
38% deviated from the plan but still completed the mission.
Only 5% aborted mid-dive.

Put differently: about 43% of deployments contained at least one failure event, yet roughly 95% still completed their primary science objectives, typically because on-board autonomy or an operator caught the problem in time. Failure events are common while mission failures are rare. That gap is exactly the empirical case for building autonomy that treats failure as something to manage rather than something that ends the dive.

The other finding is about cause distribution: no single category dominates. Hardware, software, environmental, and human origins are all roughly-equally represented. A monitoring system that only watches for hardware faults would quietly miss a big chunk of what actually goes wrong. And in more than one in five cases, the fault was constant (the kind that can sink a mission unless the architecture can anticipate and avoid it).

There’s a limitation here that motivates ongoing future work: traceability, relevance, and temporal conditions are hard to determine from static reports written after the fact (as is connecting this to the raw sensor readings that allow us to detect them). Establishing real causality usually needs live debugging access the dive reports don’t have, which is itself an argument for giving the platform better real-time fault monitoring.

Why This Matters

Tie the three axes together and you get the raw material for planning. If a robot knows that enabling a particular power channel (cause) consistently overwhelms the rest of the electrical system (condition) and knocks out its sensors (effect), it can simply plan never to enter that state again. Failure avoidance becomes a planning operation, not a hand-written exception.

It also opens the door to more useful and informative explainability: “I can’t complete this task because it requires the power channel to be on, but turning it on prevents sensor use.” For a vehicle no one can observe directly, that kind of communication is how operators understand what the system is doing (and, after the dive, how they decide what to repair or improve next time).

This is the substantial shift we’re arguing for. Instead of writing a custom, hard-coded response for every failure scenario you can imagine ahead of time, failure detection and recovery becomes part of the system’s ordinary planning and reasoning, using the domain model and planner it already has. The robot stops merely reacting to faults after they happen and, at least for the predictable and avoidable ones, starts heading them off (and, of course, can still respond to faults after they occur).

Looking Forward

The framework is grounded in AUV data, but nothing about it is specific to the ocean. It’s domain-general, and the same cause-condition-effect structure applies just as well to a drone, a warehouse robot, or a home assistant. The next step is to build the actual monitoring and reasoning processes the taxonomy implies and wire them into a robot architecture, so we can measure how much more reliable and how much more autonomous a system becomes when it treats its own failures as just another part of the plan.

Full Paper